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Abstract 


This  thesis  explores  the  power  of  interactivity  in  unsupervised  machine  learning 
problems.  Interactive  algorithms  employ  feedback-driven  measurements  to  reduce 
data  acquisition  costs  and  consequently  enable  statistical  analysis  in  otherwise  in¬ 
tractable  settings.  Unsupervised  learning  methods  are  fundamental  tools  across  a 
variety  of  domains,  and  interactive  procedures  promise  to  broaden  the  scope  of  sta¬ 
tistical  analysis.  We  develop  interactive  learning  algorithms  for  three  unsupervised 
problems:  subspace  learning,  clustering,  and  tree  metric  learning.  Our  theoretical 
and  empirical  analysis  shows  that  interactivity  can  bring  both  statistical  and  com¬ 
putational  improvements  over  non-interactive  approaches.  An  over-arching  thread 
of  this  thesis  is  that  interactive  learning  is  particularly  powerful  for  non-uniform 
datasets,  where  non-uniformity  is  quantified  differently  in  each  setting. 

We  first  study  the  subspace  learning  problem,  where  the  goal  is  to  recover  or 
approximate  the  principal  subspace  of  a  collection  of  partially  observed  data  points. 
We  propose  statistically  and  computationally  appealing  interactive  algorithms  for 
both  the  matrix  completion  problem,  where  the  data  points  lie  on  a  low  dimensional 
subspace,  and  the  matrix  approximation  problem,  where  one  must  approximate  the 
principal  components  of  a  collection  of  points.  We  measure  uniformity  with  the 
notion  of  incoherence,  and  we  show  that  our  feedback-driven  algorithms  perform 
well  under  much  milder  incoherence  assumptions. 

We  next  consider  clustering  a  dataset  represented  by  a  partially  observed  simi¬ 
larity  matrix.  We  propose  an  interactive  procedure  for  recovering  a  clustering  from  a 
small  number  of  carefully  selected  similarity  measurements.  The  algorithm  exploits 
non-uniformity  of  cluster  size,  using  few  measurements  to  recover  larger  clusters  and 
focusing  measurements  on  the  smaller  structures.  In  addition  to  coming  with  strong 
statistical  and  computational  guarantees,  this  algorithm  performs  well  in  practice. 

We  also  consider  a  specific  metric  learning  problem,  where  we  compute  a  latent 
tree  metric  to  approximate  distances  over  a  point  set.  This  problem  is  motivated  by 
applications  in  network  tomography,  where  the  goal  is  to  approximate  the  network 
structure  using  only  measurements  between  pairs  of  end  hosts.  Our  algorithms  use 
an  interactively  chosen  subset  of  the  pairwise  distances  to  leam  the  latent  tree  metric 
while  being  robust  to  either  additive  noise  or  a  small  number  of  arbitrarily  corrupted 
distances.  As  before,  we  leverage  non-uniformity  inherent  in  the  tree  metric  structure 
to  achieve  low  sample  complexity. 

Finally,  we  study  a  classical  hypothesis  testing  problem  where  we  focus  on  show 
fundamental  limits  for  non-interactive  approaches.  Our  main  result  is  a  precise  char¬ 
acterization  of  the  performance  of  non-interactive  approaches,  which  shows  that,  on 
particular  problems,  all  non-interactive  approaches  are  statistically  weaker  than  a 
simple  interactive  one.  These  results  bolster  the  theme  that  interactivity  can  bring 
about  statistical  improvements  in  unsupervised  problems. 
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Chapter  1 


Introduction 


Interactive  learning  is  a  framework  for  statistical  analysis  in  which  the  inference  procedure  inter¬ 
acts  with  the  data  acquisition  mechanism  to  make  feedback-driven  measurements.  This  frame¬ 
work,  which  is  also  referred  to  as  active  learning,  adaptive  sampling,  or  adaptive  sensing,  has 
become  increasing  popular  in  recent  years  as  it  often  significantly  reduces  overhead  associated 
with  data  collection.  On  both  theoretical  and  empirical  fronts,  interactive  learning  has  been  suc¬ 
cessfully  applied  to  a  variety  of  supervised  machine  learning  [19,  21,  22,  ,  ,  ,  ,65, 

67,1  ,  ,  ]  and  signal  processing  problems  [  ,  ,  24,  ,161].  However,  inter¬ 

active  approaches  have  not  experienced  the  same  degree  of  success  for  unsupervised  learning, 
and  our  understanding  in  this  area  is  quite  limited.  This  thesis  addresses  this  deficiency  with  an 
exploration  of  the  power  of  interactive  approaches  for  unsupervised  learning. 

Unsupervised  learning  refers  to  a  broad  class  of  learning  problems  where  the  dataset  is  not  en¬ 
dowed  with  label  information  and  the  explicit  goal  is  to  identify  some  structural  characteristics 
of  the  data.  This  contrasts  with  supervised  problems  where  data  points  are  associated  with  la¬ 
bels,  and  the  goal  is  to  learn  an  accurate  mapping  from  data  points  to  their  labels.  Examples  of 
unsupervised  learning  range  from  clustering  and  manifold  learning,  where  the  goal  is  to  capture 
locality  information,  to  hypothesis  testing,  where  the  goal  is  to  understand  the  data-generating 
process  more  generically.  Unsupervised  learning  plays  an  important  role  in  exploratory  data 
analysis,  as  it  provides  the  statistician  with  some  basic  understanding  of  the  dataset. 

Unfortunately,  unsupervised  learning  tasks,  formulations,  and  algorithms  are  extremely  varied, 
making  a  unified  treatment  challenging.  Our  study  of  interactive  approaches  for  unsupervised 
learning  therefore  focuses  on  several  important  and  representative  examples  rather  than  a  general 
treatment.  Our  choices  of  examples  are  motivated  by  two  considerations:  the  learning  problem 
should  be  widely  studied  and  practically  relevant,  and  there  should  be  concrete  applications 
where  an  interactive  approach  is  feasible.  Our  experience  is  that  ideas  in  the  development  of 
these  examples  will  be  applicable  in  other  unsupervised  learning  problems. 

Through  these  examples,  we  show  that  interactive  learning  offers  three  distinct  advantages.  First, 
interactive  algorithms  have  lower  sample  requirements  than  non-interactive  ones,  and  are  there- 
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fore  statistically  more  efficient.  Secondly,  interactive  approaches  are  particularly  powerful  when 
the  data  exhibits  high  degrees  of  non-uniformity,  as  the  sampling  mechanism  can  focus  mea¬ 
surements  to  accurately  capture  these  aspects  of  the  data.  Lastly,  interactivity  offers  a  computa¬ 
tional  improvement  as  these  algorithms  are  often  both  theoretically  and  empirically  faster  than 
non-interactive  ones.  These  claims  are  supported  by  the  several  examples  in  this  thesis.  More 
formally,  our  thesis  statement  is: 

Thesis  statement:  Interactive  data  acquisition  facilitates  statistically  and  computationally  effi¬ 
cient  unsupervised  learning  algorithms  that  are  particularly  well-suited  to  handle  non-uniform 
datasets. 

In  the  remainder  of  this  chapter,  we  describe  these  three  advantages  in  some  more  detail  and  then 
turn  to  an  overview  of  the  main  results.  We  conclude  this  chapter  with  a  broad  discussion  of 
related  work  on  interactive  learning. 


1.1  Overarching  Themes 


In  the  context  of  unsupervised  learning,  we  claim  that  interactive  approaches  offers  three  distinct 
advantages  over  non-interactive  ones.  These  are: 

1.  Statistical  efficiency:  The  main  appeal  of  interactive  learning  is  statistical  efficiency.  Intu¬ 
itively,  by  incorporating  feedback  into  the  measurement  process,  an  interactive  algorithm 
should  be  able  to  achieve  suitable  statistical  performance  with  fewer  measurements  than 
a  non-interactive  one.  Indeed,  interactive  learning  is  a  strictly  more  powerful  model,  but 
there  are  many  documented  examples  where  interactivity  is  known  to  not  provide  signifi¬ 
cant  statistical  improvements  over  non-interactive  approaches  [  1,  63,  1 14].  In  this  thesis, 
we  study  a  number  of  unsupervised  learning  problems  and  show  that  interactivity  in  fact 
does  lead  to  significantly  improved  statistical  performance. 

In  the  machine  learning  community,  statistical  efficiency  is  usually  quantified  by  sample 
complexity,  which  is  the  number  of  samples  required  to  achieve  a  certain  accuracy  in  a 
learning  task.  In  the  signal  processing  literature,  a  signal-to-noise  ratio,  which  measures 
the  problem  difficulty,  is  more  commonly  used.  We  use  both  notions  in  this  thesis,  depend¬ 
ing  on  the  problem  of  study,  but  make  fair  comparisons  to  other  approaches  throughout. 

2.  Computational  efficiency:  Given  the  increasing  size  and  complexity  of  data  sets,  com¬ 
putational  efficiency  is  an  important  consideration  when  designing  learning  algorithms.  In 
addition  to  statistical  efficiency,  we  also  argue  that  interactive  approaches  can  be  compu¬ 
tationally  more  efficient  than  non-interactive  ones,  particularly  in  unsupervised  settings. 
This  claim  is  challenging  to  argue  theoretically,  as  it  requires  establishing  a  computational 
lower  bound  on  non-interactive  algorithms,  and  proving  such  lower  bounds  is  notoriously 
hard.  We  instead  compare  our  algorithms  against  non-interactive  ones,  both  theoretically, 
in  their  asymptotic  running  times,  and  empirically,  via  extensive  simulation. 

We  find  it  surprising  that  interactive  approaches  actually  lead  to  computational  improve- 
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ments  over  non-interactive  ones,  as  many  algorithms  for  interactive  supervised  learning  do 
not  demonstrate  this  phenomenon  [  ,  ,  ].  One  exception  is  the  algorithm  due  to 

Beygelzimer  et  al.  [  ],  which  is  often  faster  than  passive  learning  in  practice,  but  reduces 

active  learning  to  a  possibly  NP-hard  zero-one  loss  empirical  risk  minimization  problem. 
One  reason  for  this  is  that  these  algorithms  perform  sophisticated  computations  to  select 
future  measurements,  while  we  find  that,  in  the  unsupervised  problems  considered  here, 
much  simpler  sampling  techniques  suffice.  These  simple  sampling  approaches,  along  with 
the  fact  that  interactive  algorithms  can  ignore  large  fractions  of  the  dataset,  lead  to  the 
computational  improvements  demonstrated  in  this  thesis. 

3.  Coping  with  non-uniformity:  Lastly,  we  find  that  interactive  learning  algorithms  are 
particularly  well-suited  to  data  sets  with  high  degrees  of  non-uniformity.  While  non¬ 
uniformity  is  quantified  differently  in  each  of  the  examples  considered  in  this  thesis,  our 
algorithms  can  quickly  identify  these  non-uniformities  and  focus  measurements  to  accu¬ 
rately  capture  these  aspects  of  the  data.  On  the  other  hand,  non-interactive  approaches 
have  high  sample  complexities  for  these  non-uniform  problems,  as  one  needs  many  mea¬ 
surements  in  certain  regions  to  achieve  suitable  accuracy.  Formalizing  this  argument,  we 
show  that  interactive  approaches  have  significantly  better  statistical  performance  than  non¬ 
interactive  ones  on  these  non-uniform  problems. 


1.2  Overview  of  Results 


In  this  thesis  we  study  four  unsupervised  learning  problems  and  develop  interactive  learning 
algorithms  for  these  problems.  The  first  three  problems  can  all  be  formalized  as  matrix  inference 
problems ;  given  feedback-driven  access  to  the  entries  of  a  dxn  matrix  X  which  may  be  corrupted 
with  noise,  we  are  interested  in  recovering  some  structural  property  of  the  matrix.  We  propose 
interactive  algorithms  to  recover  three  different  structural  properties  and  compare  against  non¬ 
interactive  approaches,  ones  that  either  observe  the  entire  matrix  or  a  subset  of  entries  acquired 
prior  to  any  computation.  In  all  three  settings,  we  show  that  our  interactive  algorithms  can 
significantly  outperform  non-interactive  ones,  in  line  with  the  over-arching  themes  of  this  thesis. 


1.2.1  Interactive  Subspace  Learning 

In  the  subspace  learning  problem,  the  data  matrix  A"  corresponds  to  a  collection  of  n  points  in 
d  dimensions,  and  the  goal  is  to  recover  a  subspace  of  W1  that  effectively  captures  the  dataset. 
When  the  data  matrix  is  fully  observed,  it  is  well  known  that  principal  components  analysis 
(PCA)  identifies  a  subspace  that  optimally  approximates  the  data  matrix  [  ].  In  the  missing 

data  setting  that  we  consider  here,  this  is  referred  to  as  the  matrix  completion  or  the  matrix 
approximation  problem  [  ,  72,  75,  97,  1(T  ,  ]. 

In  Chapter  2,  we  study  three  versions  of  the  subspace  learning  problem  and  propose  novel  al¬ 
gorithms  that  employ  interactive  sampling  to  obtain  strong  performance  guarantees.  We  first 
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consider  the  setting  where  the  data  points  lie  exactly  on  a  r-dimensional  subspace,  which  is  re¬ 
ferred  to  as  the  (noiseless)  matrix  completion  problem.  Our  algorithm  interactively  identifies 
entries  that  are  highly  informative  for  learning  the  column  space  of  the  matrix  and,  consequently, 
it  succeeds  even  when  the  row  space  is  highly  non-uniform  (according  to  a  standard  definition  of 
non-uniformity),  in  contrast  with  non-interactive  approaches.  We  show  that  one  can  exactly  re¬ 
cover  a  dxn  matrix  of  rank  r  from  merely  Q((d  +  n)r  log2(r))  matrix  entries  using  an  algorithm 
with  running  time  that  is  linear  in  the  matrix  size,  rna x{d,n},  with  a  mild  polynomial  depen¬ 
dence  on  the  rank  r.  In  addition  to  significantly  relaxing  uniformity  assumptions,  this  algorithm 
nearly  matches  the  best  known  sample  complexity  and  is  the  fastest  known  algorithm  for  matrix 
completion. 

We  generalize  this  algorithm  to  the  tensor  completion  problem,  where  the  data  is  instead  a  low- 
rank  tensor.  We  show  that  a  recursive  application  of  our  matrix  completion  algorithm  recovers  a 
rank  r  order  T  tensor  A"  e  using  Q(r7  l  T  }  d.j  log2(r))  tensor  entries,  which  is  the 

best  known  sample  complexity  for  this  problem  [  ,  ].  As  with  the  algorithm  for  the  matrix 

case,  this  algorithm  relaxes  uniformity  assumptions  and  is  extremely  fast. 

Lastly,  we  consider  the  problem  of  constructing  a  low  rank  approximation  to  a  high-rank  input 
matrix  from  interactively  sampled  matrix  entries.  This  is  referred  to  as  the  matrix  approxima¬ 
tion  problem.  We  propose  a  simple  algorithm  that  truncates  the  singular  value  decomposition  of 
a  zero-filled  version  of  the  input  matrix.  The  algorithm  computes  an  approximation  that  is  nearly 
as  good  as  the  best  rank -r  approximation  to  the  input  matrix  using  0(nr/j  log2  (n))  samples, 
where  n  is  a  uniformity  parameter  on  the  matrix  columns.  We  eliminate  uniformity  assumptions 
on  the  row  space  of  the  matrix  while  achieving  similar  statistical  and  computational  performance 
to  non-interactive  methods. 

We  demonstrate  the  statistical  and  computational  efficiency  of  all  three  of  these  procedures  with 
extensive  empirical  evaluation.  These  results  appear  in  the  papers  [121,  122]. 


1.2.2  Interactive  Hierarchical  Clustering 

We  consider  a  similarity-based  clustering  formulation  where  we  are  given  an  n  x  n  symmetric 
matrix  X  of  pairwise  similarities  between  n  objects.  In  flat  clustering  problems  the  goal  is 
to  identify  a  partitioning  of  the  objects  so  that  pairs  of  objects  in  the  same  group  have  high 
similarity  and  pairs  of  objects  in  different  groups  have  low  similarity.  In  hierarchical  clustering 
problems,  the  goal  is  to  identify  this  partitioning  structure  at  multiple  resolutions.  We  aim  to 
recover  hierarchical  cluster  structures  when  the  similarity  matrix  A"  is  only  partially  observed. 

In  Chapter  3,  we  propose  interactive  learning  algorithms  for  hierarchical  clustering  from  partially 
observed  pairwise  similarity  information.  Our  algorithm  runs  spectral  clustering  on  a  subsampled 
version  of  the  similarity  matrix  to  resolve  the  larger  cluster  structure  and  then  focuses  measure¬ 
ments  to  resolve  the  finer  partitions.  We  show  that  this  algorithm  recovers  all  clusters  of  size 
fl(logn)  using  (){n  log2  n)  similarities  and  runs  in  0{n  log3  n)  time  for  a  dataset  of  n  objects. 
In  comparison,  hierarchical  spectral  clustering  on  the  fully  observed  similarity  matrix  achieves 
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the  same  resolution  but  uses  all  0(n 2)  similarities  and  runs  in  0(n 2)  time  [  16].  This  algorithm 
is  most  effective  when  trying  to  recover  both  the  large  clusters  at  the  top  of  the  hierarchy  and  the 
small  clusters  at  the  bottom  of  the  hierarchy,  or,  in  others  words,  when  the  cluster  structure  is 
highly  non-uniform. 

We  complement  this  algorithmic  result  with  an  information-theoretic  study  of  the  hierarchical 
clustering  problem.  The  most  important  result  in  this  study  is  a  necessary  condition  for  any  non¬ 
interactive  algorithm  to  recover  a  hierarchical  clustering.  Comparing  this  necessary  condition 
with  the  sufficient  condition  developed  by  our  interactive  algorithm,  we  mathematically  certify 
the  statistical  advantage  offered  by  interactivity. 

We  evaluate  this  algorithm  with  a  detailed  empirical  study  on  simulated  and  real  clustering  data 
sets.  We  compare  with  several  popular  clustering  algorithms  and  show  that  our  proposed  algo¬ 
rithm  does  lead  to  statistical  and/or  computational  improvements  in  many  cases.  This  algorithm 
and  its  analysis  appear  in  the  paper  [  ].  The  information-theoretic  study  is  new  here. 


1.2.3  Interactive  Latent  Tree  Metric  Learning 

In  metric  learning  problems,  X  e  MnXTl  is  a  distance  matrix  between  n  points,  so  that  the  (i,  j)th 
entry  is  the  distance  between  the  / 1 h  and  yth  object.  Broadly,  the  goal  is  to  impute  distances 
between  points,  and  this  is  typically  done  by  embedding  the  points  into  some  structured  metric 
space.  In  the  instantiation  of  this  problem  that  we  study,  we  aim  to  recover  a  latent  tree  metric, 
which  associates  each  object  to  a  leaf  of  some  weighted  tree  and  approximates  distances  between 
objects  via  the  distance  along  the  tree.  This  problem  is  motivated  by  research  in  communication 
networks  showing  that  packet  latencies  can  be  well-approximated  by  latent  tree  metrics. 

In  Chapter  4,  we  present  two  algorithms  that  use  interactively  sampled  pairwise  distance  mea¬ 
surements  to  construct  a  latent  tree  whose  path  distances  approximate  those  between  the  objects. 
Our  first  algorithm  accommodates  measurements  perturbed  by  additive  noise,  while  our  second 
considers  a  novel  noise  model  that  captures  missing  measurements  and  the  datasets  deviations 
from  a  tree  topology.  Both  algorithms  provably  use  0(n  polylog  n)  pairwise  measurements  to 
construct  a  tree  approximation  on  n  end  hosts  and  run  in  nearly  linear  time.  We  present  simulated 
and  real-world  experiments  to  evaluate  both  algorithms.  These  results  appear  in  the  paper  [  ]. 


1.2.4  Passive  and  Interactive  Sampling  in  Normal  Means  Inference 

The  last  problem  we  consider  does  not  fall  into  the  matrix  inference  framework.  We  study  a 
structured  hypothesis  testing  problem  where  the  goal  is  to  use  data  generated  from  a  gaussian 
distribution  to  identify  which  vector,  out  of  a  finite  collection,  is  the  mean  vector.  We  consider 
algorithms  that  are  given  a  sensing  budget  and  asked  to  allocate  measurements  across  the  coor¬ 
dinates,  where  interactive  algorithms  can  make  this  allocation  in  a  feedback-driven  manner. 
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Our  focus  is  on  understanding  how  structural  assumptions  about  the  collection  of  mean  vec¬ 
tors  affects  statistical  performance,  and  most  of  the  results  pertain  to  non-interactive  approaches. 
Specifically,  for  any  non-interactive  allocation  strategy,  we  give  necessary  and  sufficient  con¬ 
ditions  under  which  the  identification  of  the  mean  vector  is  possible.  We  show  through  many 
concrete  examples,  that  this  analysis  leads  to  optimal  non-interactive  allocation  strategies  and 
inference  procedures.  We  also  give  a  concrete  example  where  a  simple  interactive  procedure 
significantly  outperforms  all  non-interactive  ones. 

In  this  chapter,  we  also  initiate  a  deeper  investigation  into  the  design  of  optimal  estimators.  In 
this  direction,  we  give  a  sufficient  condition,  which  depends  on  the  structure  of  the  collection  of 
vectors,  for  the  exact  optimality  of  the  maximum  likelihood  estimator.  We  also  design  a  heuristic 
algorithm  for  improving  on  the  maximum  likelihood  estimator  in  the  cases  when  it  is  suboptimal. 
We  provide  synthetic  examples  demonstrating  the  importance  of  this  improvement. 


1.3  Related  Work 


In  this  section  we  provide  a  broad  summary  of  related  work  on  interactive  learning.  Research 
on  interactive  learning  is  extremely  diverse,  in  part  due  to  the  intuitive  appeal  of  the  learning 
paradigm,  and  we  cannot  hope  to  cover  all  of  the  work  here.  Instead  we  focus  attention  on  the 
theoretical  results. 

We  categorize  the  research  based  on  types  of  learning  problems  addressed: 

1.  Classification  and  Regression:  When  focusing  on  classification  or  regression  problems, 
interactive  approaches  are  typically  referred  to  as  active  learning  [57].  In  active  learning, 
the  learner  interacts  with  the  dataset  by  querying  for  the  response  or  label  of  data  points. 
There  are  three  ways  of  realizing  this  interaction:  pool-based  [62],  where  the  learner  has 
access  to  a  large  number  of  unlabeled  examples;  stream-based  [57,  91],  where  unlabeled 
examples  are  fed  one-by-one  to  the  learner  and  it  decides  to  query  for  a  label;  and  query 
synthesis  [8,  9],  where  the  learner  can  construct  examples  to  be  labeled.  Most  of  the  recent 
attention  has  focused  on  either  pool-based  or  stream-based  active  learning,  as  the  third 
model  is  fairly  unnatural. 

The  literature  on  active  learning  alone  is  quite  vast,  but  can  roughly  be  categorized  along 
several  axes.  In  the  context  of  binary  classification,  researchers  have  considered  hypothesis 
classes  ranging  from  linear  separators  through  the  origin  [  ,  66,  ]  to  classes  of  bounded 

Vapnik-Chervonenkis  dimension  [  ,  ].  The  choice  of  noise  model  also  plays  a  role, 

with  choices  ranging  from  noise-free  or  realizable  [6  ,63,  6<  ,  ]  to  parameterized  noise 

models  [  ,  ],  to  the  most  general  agnostic  case  [2,  ,  ].  Lastly  apart  from  these 

works,  there  is  a  sequence  of  papers  on  bayesian  active  learning  where  a  prior  distribu¬ 
tion  is  placed  on  the  true  hypothesis,  and  query  decisions  are  made  through  computations 
involving  the  posterior  [96,  1  ]. 

We  refer  the  reader  to  Hanneke’s  comprehensive  treatment  of  the  theoretical  issues  in  ac- 
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tive  learning  [  105] .  For  a  more  applied  perspective,  many  algorithmic  techniques  for  active 
learning  are  outlined  in  the  survey  by  Settles  [153]. 

2.  Sequential  Decision  Making:  In  this  class  of  problems,  a  learner  makes  a  series  of  ac¬ 
tions,  possibly  based  on  situational  context,  and  receives  reward  based  on  the  quality  of 
the  choices  made,  possibly  depending  on  context.  The  goal  broadly  is  to  obtain  large  re¬ 
wards,  which  amounts  to  learning  how  to  choose  high-quality  actions.  These  problems  fall 
into  the  interactive  learning  framework  because  the  actions  that  a  learner  makes  influence 
the  reward  feedback  provided,  and  also  possibly  influence  the  future  situations.  There¬ 
fore  a  learner  must  tradeoff  between  choosing  actions  that  provide  information  about  the 
environment  and  those  that  provide  large  rewards. 

The  simplest  version  of  the  sequential  decision  making  problem  is  the  multi-arm  bandit 
problem.  In  this  problem,  there  are  a  fixed  set  of  actions  and  no  situational  context,  so  the 
goal  reduces  to  identifying  the  best  fixed  action.  An  excellent  survey  of  results  in  this  line 
of  research  is  provided  by  Bubeck  and  Cesa-Bianchi  [38]. 

Incorporating  situational  information  into  the  multi-arm  bandit  framework  yields  the  con¬ 
textual  bandit  problem.  Here  the  goal  now  amounts  to  finding  a  policy  that  maps  contexts 
into  actions  while  achieving  high  levels  of  reward.  A  number  of  recent  algorithms  address 
both  parametrized  [55,  89,  1  ],  where  the  reward  for  an  action  can  be  reliable  predicted 

based  on  some  features,  and  agnostic  [  ,  ,  ,78,  ],  where  no  features  are  available, 

versions  of  this  problem. 

Lastly,  the  most  challenging  version  of  the  sequential  decision  making  problem  is  rein¬ 
forcement  learning ,  where  the  actions  of  the  learner  affect  not  only  the  reward  and  feed¬ 
back,  but  also  the  future  situation  or  context.  In  some  models  for  this  problem,  we  know 
of  algorithms  that  achieve  nearly  optimal  statistical  performance  [  ].  An  overview  of  the 

main  techniques  for  reinforcement  learning  problems  is  provided  by  Sutton  and  Barto  [  !6], 

3.  Unsupervised  Learning  There  are  also  a  plethora  of  results  on  interactive  learning  for 
unsupervised  problems.  The  majority  of  these  results  stem  from  the  statistics  and  signal 
processing  communities  and  focus  on  various  forms  of  hypothesis  testing  problems.  Some 
more  recent  results  from  the  machine  learning  community  address  more  classical  unsuper¬ 
vised  problems  such  as  clustering  and  subspace  learning. 

In  the  statistics  literature,  interactive  learning  is  typically  referred  to  as  sequential  exper¬ 
imental  design  and  includes  the  seminal  works  of  Wald  [169],  Chernoff  [5  ],  and  Rob¬ 
bins  [  16] .  The  techniques  are  strikingly  similar  to  those  for  the  multi-arm  bandit  problem, 
and  indeed  both  lines  of  research  stem  from  the  initial  works  of  Robbins  and  Lai  [  25], 

In  the  signal  processing  community,  interactive  learning  is  typically  referred  to  as  adaptive 
sensing ,  and  the  typical  goal  is  multiple  hypothesis  testing  from  repeated  direct  or  com¬ 
pressive  measurements.  When  individual  hypothesis  can  be  queried,  the  distilled  sensing 
algorithm  [10'  ]  is  known  to  outperform  non-adaptive  sampling  schemes,  and  this  work  has 
been  extended  to  some  structured  settings  [161].  When  compressive  measurements  can  be 
taken,  results  under  specific  structural  constraints  are  known  [17,  ],  and  unstructured 

lower  bounds  show  that  significant  performance  improvements  over  non-adaptive  proce¬ 
dures  are  not  possible  [11]. 
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Interactive  approaches  have  also  been  considered  for  more  classical  unsupervised  learn¬ 
ing  problems,  with  most  of  the  focus  on  clustering  and  kernel  learning.  A  number  of 
algorithms  have  been  proposed  for  interactive  clustering  both  in  hierarchical  and  flat  set¬ 
tings  [15,  18,  ?,  ],  although  many  of  these  approaches  consider  interactive  supervision 

in  the  from  of  constraints  on  the  clustering  rather  than  interactivity  with  object  features  or 
similarities  as  we  do.  The  advent  of  crowdsourcing  platforms  has  also  lead  to  research  on 
learning  via  interaction  with  crowds  of  workers  [  ] .  Lastly,  interactive  learning  is  the  de 

facto  standard  for  problems  in  network  tomography,  including  topology  identification  [84  ], 
topology-aware  clustering  [56],  and  other  tasks  [  15]. 
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Chapter  2 


Interactive  Matrix  Completion 


In  this  chapter,  we  develop  interactive  algorithms  for  low  rank  matrix  and  tensor  completion  and 
matrix  approximation.  In  the  completion  problem,  we  would  like  to  exactly  recover  a  low  rank 
matrix  (or  tensor)  after  observing  only  a  small  fraction  of  its  entries.  In  the  approximation  prob¬ 
lem,  rather  than  exact  recovery,  we  aim  to  find  a  low  rank  matrix  that  approximates,  in  a  precise 
sense,  the  input  matrix,  which  need  not  be  low  rank.  In  both  problems,  we  are  only  allowed  to 
observe  a  small  number  of  matrix  entries,  although  these  entries  can  be  chosen  sequentially  and 
in  a  feedback-driven  manner. 

The  measure  of  uniformity  in  this  chapter  is  the  notion  of  incoherence  which  pervades  the  ma¬ 
trix  completion  literature.  We  show  that  interactive  sampling  allows  us  to  significantly  relax  the 
incoherence  assumption.  Previous  analyses  show  that  if  the  energy  of  the  matrix  is  spread  out 
fairly  uniformly  across  its  coordinates,  then  uniform- at-random  samples  suffice  for  completion  or 
approximation.  In  contrast,  our  work  shows  that  interactive  sampling  algorithms  can  focus  mea¬ 
surements  appropriately  to  solve  these  problems  even  if  the  energy  is  non-uniformly  distributed. 
Handling  non-uniformity  is  essential  in  a  variety  of  problems  involving  outliers,  for  example 
network  monitoring  problems  with  anomalous  hosts,  or  recommendation  problems  with  popular 
items.  This  is  a  setting  where  non-interactive  algorithms  fail,  as  we  will  show. 

We  make  the  following  contributions: 

1 .  For  the  matrix  completion  problem,  we  give  a  simple  algorithm  that  exactly  recovers  an 
n  x  n  rank  r  matrix  using  at  most  O(nr/i0  log2(r))  measurements  where  /io  is  the  coherence 
parameter  on  the  column  space  of  the  matrix  (Corollary  2.2).  This  algorithm  outperforms 
all  existing  results  on  matrix  completion  both  in  terms  of  sample  complexity  and  in  the  fact 
that  we  place  no  assumptions  on  the  row  space  of  the  matrix.  The  algorithm  is  extremely 
simple,  runs  in  0(nr 2)  time,  and  can  be  implemented  in  one  pass  over  the  matrix. 

2.  We  derive  a  lower  bound  showing  that  in  the  absence  of  row-space  incoherence,  any  non¬ 
interactive  scheme  must  see  Q(w2)  entries  (Theorem  2.3).  This  concretely  demonstrates 
the  power  of  interactivity  in  the  matrix  completion  problem. 
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3.  For  the  tensor  completion  problem,  we  show  that  a  recursive  application  of  our  matrix  com¬ 
pletion  algorithm  can  recover  an  order-T,  n  x  . . .  x  n  tensor  using  0(nT2rT~l  n q -1  log2  r) 
interactively-obtained  samples  (Theorem  2.1).  This  algorithm  significantly  outperforms 
all  existing  results  on  tensor  completion  and  as  above,  is  quite  simple. 

4.  We  complement  this  with  a  necessary  condition  for  tensor  completion  under  random  sam¬ 
pling,  showing  that  our  interactive  strategy  is  competitive  with  any  approach  based  on 
uniform  sampling  (Theorem  2.4).  This  is  the  first  sample  complexity  lower  bound  for  ten¬ 
sor  completion,  although  it  is  weaker  than  the  lower  bound  for  the  matrix  completion  case 
in  Corollary  2.2. 

5.  For  matrix  approximation,  we  analyze  an  algorithm  that,  after  an  interactive  sampling 
phase,  approximates  the  input  matrix  by  the  top  r  ranks  of  an  appropriately  rescaled  zero- 
filled  version  of  the  matrix.  We  show  that  with  just  0{nr^i  log2 (n))  samples,  this  approx¬ 
imation  is  competitive  with  the  best  rank  r  approximation  of  the  matrix  (Theorem  2.5). 
Here  /a  is  a  coherence  parameter  on  each  column  of  the  matrix;  as  before  we  make  no 
assumptions  about  the  row  space  of  the  input.  Again,  this  result  significantly  outperforms 
existing  results  on  matrix  approximation  from  non-interactively  collected  samples. 

This  chapter  is  organized  as  follows:  we  conclude  this  introduction  with  some  basic  definitions 
and  then  turn  to  related  work  in  Section  2.1.  The  main  results  for  the  exact  completion  problems 
are  given  in  Section  2.2  while  our  matrix  approximation  algorithm  and  analysis  are  in  Section  2.3 
Proofs  are  provided  in  Section  2.4  and  we  provide  some  simulation  that  validate  our  theoretical 
results  in  Section  2.5.  We  conclude  the  chapter  in  Section  2.6. 


2.0.1  Preliminaries 

In  this  chapter,  we  are  interested  in  recovering,  or  approximating,  a  d  x  n  matrix  A"  given  a  budget 
of  M  observations,  where  we  assume  d  <  n.  We  denote  the  columns  of  X  by  x±, . . . ,  xn  G 
and  use  t  to  index  the  columns.  We  use  xt(i)  to  denote  the  Ah  coordinate  of  the  column  xt. 

We  use  capital  letters  to  denote  subspaces  and  we  overload  notation  by  using  the  same  symbol 
to  refer  to  a  subspace  and  any  orthonormal  basis  for  that  subspace.  Specifically,  if  U  C  is 
a  subspace  with  dimension  r,  we  may  use  U  to  refer  to  a  d  x  r  matrix  whose  columns  are  an 
orthonormal  basis  for  that  subspace.  We  use  UL  to  denote  the  orthogonal  complement  to  the 
subspace  U  and  Tjj  to  refer  to  the  orthogonal  projection  operator  onto  U . 

As  we  are  dealing  with  missing  data  and  sampling,  we  also  need  some  notation  for  subsampling 
operations.  Let  [d\  denote  the  set  {1, . . . ,  d}  and  let  H  be  a  list  of  m  values  from  [d\,  possibly 
with  duplicates  (One  can  think  of  fi  as  a  vector  in  \d]m  and  f  l (j )  is  the  jth  coordinate  of  this 
vector).  Given  such  a  list  Q,  we  use  two  different  subsampling  operations:  G  M'm  is  the  vector 

formed  putting  x(i)  in  the  yth  coordinate  if  f l (j )  =  i  and  1Z<yx  is  a  zero-filled  rescaled  version 
of  x  with  lZnx{i)  =  0  if  i  H  and  7 Znx(i)  =  dx(i)/\Vt\  if  f  G  Q. 

For  a  r-dimensional  subspace  U  C  Md,  U<i  G  Mmxr  is  a  matrix  formed  by  doing  a  similar 
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subsampling  operation  to  the  rows  of  any  orthonormal  basis  for  the  subspace  U,  e.g.  the  jth  row 
of  Uq  is  the  / 1 h  row  of  U  if  Q(j)  =  i.  Note  that  Un,  and  even  the  span  of  the  columns  of  Uq, 
may  not  be  uniquely  defined,  as  they  both  depend  on  the  choice  of  basis  for  U.  Nevertheless,  we 
will  use  Vun  to  denote  the  projection  onto  the  span  of  any  single  set  of  columns  constructed  by 
this  subsampling  operation. 

These  definitions  extend  to  the  tensor  setting  with  slight  modifications.  Let  X  e  K"'  x-xnT 
denote  an  order  T  tensor  with  canonical  decomposition: 

X  =  afc1)  ®  ai2)  ®  ®  ci'P  (2.1) 

k=X 

where  <E>  is  the  outer  product.  Define  rank(X)  to  be  the  smallest  value  of  r  that  establishes  this 
equality.  Note  that  the  vectors  nee^  not  be  orthogonal,  nor  even  linearly  independent. 

We  then  use  the  vec  operation  to  unfold  a  tensor  into  a  vector  and  define  the  inner  product 
(x,  y)  =  vec(x)Tvec(y).  For  a  subspace  U  C  M0ni,  we  write  it  as  a  (]/[  rii)  x  d  matrix  whose 
columns  are  vec(wj),  ut  G  U.  We  then  define  projections  and  subsampling  as  in  the  vector  case. 

We  will  frequently  work  with  the  truncated  singular  value  decomposition  (SVD)  of  X  which  is 
given  by  zero-ing  out  its  smaller  singular  values.  Specifically,  write  A"  =  C/rXrl/ ’J  +  U-X_rVTr 
where  [Ur,  U-r]  (respectively  \Vr,  V-r})  forms  an  orthonormal  matrix  and  Xr  =  diag(cri, . . . ,  ay), 
X_r  =  diag(crr+i, . . . ,  ad)  are  diagonal  matrices  with  ay  >  . . .  >  ar  >  ar+\  >  . . .  >  ad.  The 
truncated  singular  value  decomposition  is  Xr  =  UrY,rVrT,  which  is  the  best  rank -r  approximation 
to  A"  both  in  Frobenius  and  spectral  norm  [8!  ]. 

In  the  matrix  completion  problem,  where  we  aim  for  exact  recovery,  we  require  that  X  has  rank 
at  most  r,  meaning  that  crr+1  —  ...  —  an  —  0.  Thus  X  =  Xr,  and  our  goal  is  to  recover  Xr 
exactly  from  a  subset  of  entries.  Specifically,  we  focus  on  the  0/1  loss;  given  an  estimator  X  for 
X,  we  would  like  to  bound  the  probability  of  error: 

Roi(X)  =  F(x  ^X^.  (2.2) 


In  the  approximation  problem,  we  relax  the  low  rank  assumption  but  are  only  interested  in  ap¬ 
proximating  the  action  of  Xr.  The  goal  is  to  find  a  rank  r  matrix  X  that  minimizes: 

R(X)  =  \\X  -  A'||f. 

The  matrix  Xr  is  the  global  minimizer  (subject  to  the  rank-r  constraint),  and  our  task  is  to  ap¬ 
proximate  this  low  rank  matrix  effectively.  Specifically,  we  will  be  interested  in  finding  matrices 
X  that  satisfy  excess  risk  bounds  of  the  form: 

R{ X)  ±  \\X  -  X\\F  <  \\X  -  Xr\\F  +  e\\X\\F  (2.3) 

Rescaling  the  excess  risk  term  by  ||A||F  is  a  form  of  normalization  that  has  been  used  before  in 
the  matrix  approximation  literature  [  ,76,9  ,  ].  While  bounds  of  the  form  (1+e)  ||  A"— A^Hi? 
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may  seem  more  appropriate  when  the  bottom  ranks  are  viewed  as  noise  term,  achieving  such  a 
bound  seems  to  require  highly  accurate  approximations  of  the  SVD  of  the  input  matrix  [  ], 

which  is  not  possible  given  the  extremely  limited  number  of  observations  in  our  setting.  Equa¬ 
tion  2.3  can  be  interpreted  by  dividing  by  \\X ||f,  which  shows  that  X  captures  almost  as  large  a 
fraction  of  the  energy  of  X  as  Xr  does. 

Apart  from  the  observation  budget  M  and  the  approximation  rank  r,  the  other  main  quantity  gov¬ 
erning  the  difficulty  of  these  problems  is  the  subspace  coherence  parameter.  For  a  r  dimensional 
subspace  U  of  Md,  define 

A i(U)  =  -max\\Vuei\\l, 

r  i£[d] 

which  is  a  standard  measure  of  subspace  coherence  [  ]).  The  quantity  /i0  =  fi(Ur),  which 

is  bounded  between  1  and  d/r,  measures  the  correlation  between  the  column  space  Ur  and  any 
standard  basis  element.  When  this  correlation  is  small,  the  energy  of  the  matrix  is  spread  out 
fairly  uniformly  across  the  rows  of  the  matrix,  although  it  can  be  non-uniformly  distributed 
across  the  columns.  We  use  the  column-space  coherence  /i0  instead  of  the  row-space  analog,  and 
we  will  see  that  the  parameter  /i0  controls  the  sample  complexity  of  our  procedure. 


Such  an  incoherence  assumption  does  not  translate  appropriately  to  the  approximate  recovery 
problem,  since  the  matrix  is  no  longer  low  rank,  but  some  measure  of  uniformity  is  still  necessary. 
We  parameterize  the  problem  by  a  quantity  related  to  the  usual  definition  of  incoherence: 


max 

te[n] 


d\\xt\\lo 

MI 


which  is  the  maximal  column  coherence.  Here,  we  make  no  stochastic  assumptions,  but  notice 
that  this  is  a  restriction  on  the  higher  ranks  of  the  matrix.  We  also  make  no  assumptions  about 
the  row  space  of  the  matrix1. 


2.1  Related  Work 


The  literature  on  low  rank  matrix  completion  and  approximation  is  extremely  vast  and  we  do  not 
attempt  to  cover  all  of  the  existing  ideas.  Instead,  we  focus  on  the  most  relevant  lines  of  work  to 
our  specific  problems.  We  briefly  mention  some  related  work  on  adaptive  sensing. 


2.1.1  Related  work  on  Matrix  and  Tensor  Completion 

Due  to  its  widespread  applicability,  the  matrix  completion  problem  has  received  considerable 
attention  in  recent  years.  A  series  of  papers  [42,  43,  50,  97,  ]  establish  that  Q(nrfi'0  log2(n)) 

randomly  drawn  samples  are  sufficient  for  the  nuclear  norm  minimization  program  to  exactly 

'As  before  this  could  equivalently  be  the  column  space  with  assumption  on  the  maximal  row  coherence. 
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identify  annxn  matrix  with  rank  r.  Here  p  '0  =  rna  x{p(Ur),  p(Vr)}  is  the  coherence  parameter, 
which  measures  the  uniformity  of  both  the  row  and  column  spaces  of  the  matrix.  Candes  and 
Tao  [  ]  show  that  nuclear  norm  minimization  is  essentially  optimal  with  a  Q(nrp0  log(n))  lower 

bound  for  uniform-at-random  sampling.  In  contrast,  the  guarantee  for  our  interactive  procedure 
scales  linearly  on  /i0  =  ji(Ur),  so  our  algorithm  succeeds  even  when  the  row  space  is  highly 
coherent.  This  is  a  regime  where  non-interactive  provably  fail,  as  we  will  show. 

There  is  also  a  line  of  work  analyzing  alternating  minimization-style  procedures  for  the  matrix 
completion  problem  [  06,  11  ,116].  While  the  alternating  minimization  algorithm  is  a  more 
elegant  computational  approach,  the  best  sample  complexity  bounds  to-date  are  either  worse  by 
at  least  a  cubic  factor  in  the  rank  r  or  have  undesirable  dependence  on  the  matrix  condition 
number  [  16].  In  practice  however,  alternating  minimization  performs  as  well  as  nuclear  norm 
minimization,  so  this  sub-optimality  appears  to  be  an  artifact  of  the  analysis. 

In  a  similar  spirit  to  our  work,  Chen  et  al.  [52]  developed  an  interactive  algorithm  which  suc¬ 
ceeds  in  the  absence  of  row-space  incoherence  using  f)  (nr /r0  log2  (n))  samples.  In  compar¬ 
ison,  we  operate  under  the  same  assumption  but  achieve  an  improved  sample  complexity  of 
H(nr/i0  log2(r)).  A  recent  paper  of  Jin  and  Zhu  [  ]  further  improves  slightly  on  this  bound, 

achieving  Q(nr  log(r))  sample  complexity,  but  they  assume  that  both  the  row  and  column  space 
are  incoherent.  Interestingly,  their  algorithm  uses  non-interactive  but  non-uniform  sampling. 

Tensor  completion,  a  natural  generalization  of  matrix  completion,  is  less  studied  than  the  ma¬ 
trix  case.  One  challenge  stems  from  the  NP-hardness  of  computing  most  tensor  decomposi¬ 
tions,  pushing  researchers  to  study  alternative  structure-inducing  norms  in  lieu  of  the  nuclear 
norm  [93,  132,  162,  163,  164,  173],  Of  these,  only  Mu  et  al.  [132]  and  Yuan  and  Zhang  [173] 
provide  sample  complexity  bounds  for  the  noiseless  setting.  Mu  et  al.  [  ]  show  that  Q(rnT /2) 

random  linear  measurements  suffice  to  recover  a  rank  r  order- T  tensor.  Yuan  and  Zhang  [173] 
instead  show  that  Q(r  l/,2n3/2)  entries  suffice  to  recover  a  rank  r  third-order  tensor  with  incoher¬ 
ent  subspaces,  provided  the  rank  is  small.  In  contrast,  the  sample  complexity  of  our  algorithm  is 
linear  in  dimension  n,  improving  significantly  on  these  non-interactive  results. 


2.1.2  Related  work  on  Matrix  Approximation 

A  number  of  authors  have  studied  matrix  completion  with  noise  and  under  weaker  assumptions. 
The  most  prominent  difference  between  our  work  and  all  of  these  is  a  relaxation  of  the  main 
incoherence  assumptions.  Both  Candes  and  Plan  [  ],  and  Keshavan  et  al.  [  ]  require  that 

both  the  row  and  column  space  of  the  matrix  of  interest  is  highly  incoherent.  Negahban  and 
Wainwright  [  ]  instead  use  a  notion  of  spikiness,  but  that  too  places  assumptions  on  the  row 

space  of  interest.  Koltchinskii  et  al.  [  19]  consider  matrices  with  bounded  entries,  which  is 
related  to  the  spikiness  assumption.  In  comparison,  our  results  make  essentially  no  assumptions 
about  the  row  space,  leading  to  substantially  more  generality.  This  is  the  thesis  of  this  work; 
one  can  eliminate  row  space  assumptions  (uniformity  assumptions)  in  matrix  recovery  problems 
through  interactive  sampling. 
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Another  close  line  of  work  is  on  matrix  sparsification  [  ,2,  12].  Here,  the  goal  is  to  zero  out 
many  entries  of  a  matrix  while  preserving  global  properties  such  as  the  principal  subspace.  The 
main  difference  from  matrix  completion  is  that  the  entire  matrix  is  observed,  which  allows  one 
to  relax  incoherence  assumptions.  The  only  result  from  this  line  that  does  not  require  knowledge 
of  the  matrix  is  a  random  sampling  scheme  of  Achlioptas  and  McSherry  [  ],  but  it  is  only  com¬ 
petitive  with  matrix  completion  when  the  input  has  entries  of  fairly  constant  magnitude  [  19]. 
Interestingly,  this  requirement  is  essentially  the  same  as  the  spikiness  assumption  [  ]  and  the 

bounded  magnitude  assumption  [119]  in  the  matrix  completion  literature. 


Several  techniques  have  been  proposed  for  matrix  approximation  in  the  fully  observed  setting, 
optimizing  computational  complexity  or  other  objectives.  A  particularly  relevant  series  of  papers 
is  on  the  column  subset  selection  (CSS)  problem,  where  the  span  of  several  judiciously  chosen 
columns  is  used  to  approximate  the  principal  subspace.  One  of  the  best  approaches  involves 
sampling  columns  according  to  the  statistical  leverage  scores,  which  are  the  norms  of  the  rows 
of  the  n  x  r  matrix  formed  by  the  top  r  right  singular  vectors  [36,  7,  7'  ].  Unfortunately,  this 
strategy  does  not  seem  to  apply  in  the  missing  data  setting,  as  the  distribution  used  to  sample 
columns  -  which  are  subsequently  used  to  approximate  the  matrix  -  depends  on  the  unobserved 
input  matrix.  Approximating  this  distribution  seems  to  require  a  very  accurate  estimate  of  the 
matrix  itself,  and  this  initial  estimate  would  suffice  for  the  matrix  approximation  problem.  This 
difficulty  also  arises  with  volume  sampling  [  ] ,  another  popular  approach  to  CSS ;  the  sampling 

distribution  depends  on  the  input  matrix  and  we  are  not  aware  of  strategies  for  approximating  this 
distribution  in  the  missing  data  setting. 


In  terms  of  interactive  sampling,  a  number  of  methods  for  recovery  of  sparse,  structured,  signals 
have  been  shown  to  outperform  non-interactive  methods  [  ,  7,124,  ,  161].  While  having 

their  share  of  differences,  these  methods  can  all  be  viewed  as  either  binary  search  or  local  search 
methods,  that  iteratively  discard  irrelevant  coordinates  and  focus  measurements  on  the  remainder. 
In  particular,  these  methods  rely  heavily  on  the  sparsity  and  structure  of  the  input  signal,  and 
extensions  to  other  settings  have  been  elusive.  While  a  low  rank  matrix  is  sparse  in  its  eigenbasis, 
the  search-style  techniques  from  the  signal  processing  community  do  not  seem  to  leverage  this 
structure  effectively  and  these  approaches  do  not  appear  to  be  applicable  to  our  setting. 


Some  of  these  interactive  sampling  efforts  focus  specifically  on  recovering  or  approximating 
highly  structured  matrices,  which  is  closely  related  to  our  setting.  Tanczos  and  Castro  [  6  ]  and 
Balakrishnan  et  al.  [  ]  consider  variants  of  biclustering,  which  is  equivalent  to  recovering  a 

rank-one  binary  matrix  from  noisy  observations.  Singh  et  al.  [  158]  recover  noisy  ultrametric  ma¬ 
trices  while  in  Chapter  3,  we  use  a  similar  idea  to  find  hierarchical  clustering  from  interactively 
sampled  similarities.  All  of  these  results  can  be  viewed  as  matrix  completion  or  approximation, 
but  impose  significantly  more  structure  on  the  target  matrix  than  we  do  here.  For  this  reason, 
many  of  these  algorithmic  ideas  also  do  not  appear  to  be  useful  in  our  setting. 
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Algorithm  1  Interactive  Matrix  Completion  (X  e  Wly'ri  ,  m ) 

1.  Let  U  —  0. 

2.  Randomly  draw  entries  C  [d]  of  size  m  uniformly  with  replacement. 

3.  For  each  column  xt  of  A"  (t  G  [ N ]): 

(a)  If  Hxtn  -  V^xta |||  >  0: 

i.  Fully  observe  xt  and  add  to  U  (orthogonalize  U). 

ii.  Randomly  draw  a  new  set  0  of  size  m  uniformly  with  replacement. 

(b)  Otherwise  xt  <—  UiU^Un^UnXm- 

4.  Return  X  with  columns  xt. 


2.2  Matrix  and  Tensor  Completion 


In  this  section  we  develop  the  main  theoretical  guarantees  on  the  exact  low-rank  completion 
problems.  We  first  develop  our  interactive  algorithm  for  matrices  and  tensors  and  state  their 
main  performance  guarantee.  We  then  turn  to  several  necessary  conditions  for  these  problems. 

Our  procedure  for  the  matrix  case,  whose  pseudocode  is  displayed  in  Algorithm  1,  streams  the 
columns  of  the  matrix  X  into  memory  and  iteratively  adds  directions  to  an  estimate  for  the 
column  space  of  A.  The  algorithm  maintains  a  subspace  U  and,  when  processing  the  tth  column 
xt,  estimates  the  norm  of  Vu±xt  using  only  a  few  entries  of  xt.  We  will  ensure  that,  with  high 
probability,  this  estimate  will  be  non-zero  if  and  only  if  xt  contains  a  new  direction.  If  the 
estimate  is  non- zero,  the  algorithm  asks  for  the  remaining  entries  of  xt  and  adds  the  new  direction 
to  the  subspace  U.  Otherwise,  xt  lies  in  U  and  we  will  see  that  the  algorithm  already  has  sufficient 
information  to  complete  the  column  xt. 

Therefore,  the  key  ingredient  of  the  algorithm  is  the  estimator  for  the  projection  onto  the  or¬ 
thogonal  complement  of  the  subspace  U.  This  quantity  is  estimated  as  follows.  Using  a  list  of 
m  locations  Q  sampled  uniformly  with  replacement  from  [d],  we  down-sample  both  xt  and  an 
orthonormal  basis  U  to  xto  and  Uq.  We  then  use  ||xtn  —  Vunxtn ||2  as  our  estimate.  It  is  easy  to 
see  that  this  estimator  leads  to  a  test  with  one-sided  error,  since  the  estimator  is  exactly  zero  if 
xt  G  U.  In  our  analysis,  we  establish  a  relative-error  deviation  bound,  which  allows  us  to  apply 
this  test  in  our  algorithm. 

A  subtle  but  critical  aspect  of  the  algorithm  is  the  choice  of  Q.  The  fist  Q  always  has  m  elements, 
and  each  element  is  sampled  uniformly  with  replacement  from  [d] .  More  importantly,  we  only 
resample  Q  when  we  add  a  direction  to  U.  This  ensures  that  the  algorithm  does  not  employ  too 
much  randomness,  which  would  lead  to  an  undesirable  logarithmic  dependence  on  n. 

For  tensors,  the  algorithm  becomes  recursive  in  nature.  At  the  outer  level  of  the  recursion,  the 

(T) 

algorithm  maintains  a  candidate  subspace  U  for  the  mode  T  subtensors  Xj  .  For  each  of  these 

(T) 

subtensors,  we  test  whether  X-  lives  in  U  and  recursively  complete  that  subtensor  if  it  does 
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Algorithm  2  Interactive  Tensor  Completion  (X,  {mt}J=^) 

1.  If  X  is  just  a  vector,  sample  X  entirely  and  return  it. 

2.  Let  U  =  0. 

3.  Randomly  draw  entries  Q  C  nLV  lnt\  uniformly  with  replacement  w.  p.  mT- 1/  nLY  nt- 

4.  For  each  mode- T  subtensor  X-T)  ofX,  i  e  [nr]: 

(a)  If||xf -P„„xg|||>0: 

i.  if  <—  recurse  on 

ii.  Ui  <-  Vu±X}T)  • 

(b)  Otherwise  if: ’  ^  UiU^U^U^ 

~  (T) 

5.  Return  X  with  mode-T  subtensors  X?; 


not.  Once  we  complete  the  subtensor,  we  add  it  to  U  and  proceed  at  the  outer  level.  When  the 
subtensor  itself  is  just  a  column,  we  observe  the  columns  in  its  entirety. 


Turning  to  the  performance  guarantees  for  these  algorithms,  we  first  bound  the  probability  of 
error  for  the  tensor  completion  algorithm  (Algorithm  2).  The  guarantee  for  Algorithm  1  is  just  a 
specialization  of  this  result  to  the  order-two  case.  The  following  result  is  based  on  an  analysis  of 
the  test  statistic  and  the  reconstruction  procedure  in  Algorithm  2.  See  Section  2.4  for  the  proof. 
Theorem  2.1.  Let  X  =  Jfri=1  be  a  rank  r  order-T  tensor  with  subspaces  A ^  = 

span({a.j^}j=1).  Suppose  that  all  of  A^l\  . . .  have  coherence  bounded  above  by  /iq.  For 

any  5  G  (0, 1),  Algorithm  2  has  R0 i(X)  <  5  provided  that  we  set: 

mt  >  32Trtpt0log2(10rT/5).  (2.4) 


With  this  choice,  the  total  number  of  samples  used  is: 

T 

nt)rT-lpl~lT  log(10rT/5). 
t=  i 


The  running  time  of  the  algorithm  is: 

nt 

when  we  treat  )A)  as  a  constant  and  ignore  logarithmic  factors. 


O  r 2 


(T- 1 

n 

,t=i 


+  r 


T 

TX 

t= i 


nt  +  Tr 


2+T 


(2.5) 


(2.6) 


In  the  special  case  of  a  n  x  . . .  x  n  tensor  of  order  T,  the  algorithm  succeeds  with  probability 
at  least  1  —  5  using  fl(?rrT~1'/2/u^~1T2  log(Tr/5))  samples,  exhibiting  a  linear  dependence  on 
the  tensor  dimensions.  In  comparison,  all  guarantees  for  tensor  completion  we  are  aware  of 
have  super- linear  dependence  on  the  tensor  dimension  n  [  ,  ].  To  our  knowledge,  the  best 
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known  sample  complexity  is  0(r1//2n3//2)  for  exact  recovery  of  a  n  x  n  x  n  third-order  tensor 
of  rank  r  [  ].  An  alternating  minimization  procedure  is  known  to  achieve  0(r5n:i/2)  sample 

complexity  for  this  problem  [ill]. 

In  the  noiseless  scenario,  one  can  unfold  the  tensor  into  a  ri\  x  X\^=2  A,  matrix  and  apply  any 
matrix  completion  algorithm.  Unfortunately,  without  exploiting  the  additional  tensor  structure, 
this  approach  will  scale  with  Y\a=2  nt>  which  is  similarly  much  worse  than  our  guarantee.  Note 
that  the  naive  procedure  that  does  not  perform  the  recursive  step  has  sample  complexity  scaling 
with  the  product  of  the  dimensions  and  is  therefore  much  worse  than  the  our  algorithm. 

The  most  obvious  specialization  of  Theorem  2.1  is  to  the  matrix  completion  problem.  Pseu¬ 
docode  for  this  algorithm  is  provided  in  Algorithm  1 

Corollary  2.2.  Let  X  e  M.dxn  have  rank  r  and  column  space  U  with  coherence  p(U)  <  /i0. 
Then  for  any  5  e  (0, 1),  the  output  of  Algorithm  2  has  risk  Rm(X)  <  5  provided  that: 

m  >  32r/r0log2(10r2/(5).  (2.7) 

The  sample  complexity  is  dr  +  nm  and  the  running  time  is  0(nmr  +  r3m  +  dr2). 

To  the  best  of  our  knowledge,  this  result  provides  the  strongest  guarantee  for  the  matrix  comple¬ 
tion  problem.  The  vast  majority  of  results  require  both  incoherent  row  and  column  spaces  and 
are  therefore  considerably  more  restrictive  than  ours  [42,  43,  50,  97,  1  ].  For  example,  Recht 

shows  that  by  solving  the  nuclear  norm  minimization  program,  one  can  recover  X  exactly,  pro¬ 
vided  that  the  number  of  measurements  exceeds  32 (d  +  n)r  max  {//[>,  //f }  log2(n)  where  recall 
that  pf0  upper  bounds  the  coherence  of  both  the  row  an  d  column  space,  and  ji\  provides  another 
incoherence-type  assumption  (which  can  be  removed  [5  ]).  Our  result  improves  on  his  not  only 
in  relaxing  the  row  space  incoherence  assumption,  but  also  in  terms  of  sample  complexity,  as  we 
remove  the  logarithmic  dependence  on  problem  dimension. 

As  another  example,  Gittens  [95]  showed  that  Nystrom  method  can  recover  a  rank  r  matrix 
from  randomly  sampling  0(r  logr)  columns.  While  his  result  matches  ours  in  terms  of  sample 
complexity,  he  analyzes  positive-semidefinite  matrices  with  incoherent  principal  subspace,  which 
translates  to  assuming  that  both  row  and  column  spaces  are  incoherent.  Again,  in  relaxing  this 
assumption,  our  result  is  substantially  more  general. 

We  mention  that  the  two-phase  algorithm  of  Chen  et  al.  [5i  ]  based  on  local  coherence  sampling 
allows  for  coherent  row  spaces.  Their  algorithm  requires  0((n  +  d) r //0  log(n))  samples  which  is 
weaker  than  our  guarantee  in  that  it  has  a  slightly  super-linear  dependence  on  problem  dimension. 
An  interesting  consequence  of  Corollary  2.2  is  that  the  amortized  number  of  samples  per  column 
is  completely  independent  of  the  problem  dimension. 

Regarding  computational  considerations,  the  algorithm  operates  in  one  pass  over  the  columns, 
and  need  only  store  the  matrix  in  condensed  form,  which  requires  0((n+d)r)  space.  Specifically, 
the  algorithm  maintains  a  (partial)  basis  for  column  space  and  the  coefficients  for  representing 
each  column  by  that  basis,  which  leads  to  an  optimally  condensed  representation.  Moreover, 
the  computational  complexity  of  the  algorithm  is  linear  in  the  matrix  dimensions  d,  n  with  mild 


17 


polynomial  dependence  on  the  rank  r.  For  this  run-time  analysis,  we  work  in  a  computational 
model  where  accessing  any  entry  of  the  matrix  is  a  constant-time  operation,  which  allows  us  to 
circumvent  the  Q(dn)  time  it  would  otherwise  take  to  read  the  input.  In  comparison,  the  two  stan¬ 
dard  algorithms  for  matrix  completion,  the  iterative  Singular  Value  Thresholding  Algorithm  [  ] 

and  alternating  least-squares  [  ,  ],  are  significantly  slower  than  Algorithm  2,  not  only  due 

to  their  iterative  nature,  but  also  in  per-iteration  running  time. 


2.2.1  Necessary  conditions  for  non-interactive  sampling 

In  this  section  we  prove  a  number  of  lower  bounds  for  matrix  and  tensor  completion  for  non¬ 
interactive  sampling  procedures.  Note  that  a  parameter  counting  argument  shows  that  interactive 
sampling  requires  Q(r  Y^t=i  nt)  samples.  Each  entry  of  a  rank  r  tensor  can  be  expressed  as  a 
polynomial  of  the  vectors  in  the  canonical  decomposition,  so  the  observations  lead  to  a  polyno¬ 
mial  system  in  r  Ylt=1  nt  variables.  If  M  <  r  Y^t=\  nt  ~  T  (',)  (there  are  TQ  orthonormality 
constraints),  then  this  system  is  underdetermined,  and  since  it  has  one  solution,  it  must  have 
infinitely  many,  so  that  recovery  is  impossible.  Our  algorithm  matches  this  lower  bound  in  its 
dependence  on  the  tensor  dimensions,  but  is  polynomially  worse  in  terms  of  the  rank  r.  How¬ 
ever  for  the  matrix  case,  Corollary  2.2  shows  that  our  matrix  completion  algorithm  is  nearly 
optimal,  disagreeing  only  in  its  dependence  on  the  column  incoherence  parameter  and  logarith¬ 
mic  factors.  In  this  section  we  will  show  that  non-interactive  sampling  has  much  more  stringent 
necessary  conditions. 

Our  first  result  is  a  necessary  condition  against  non-interactive  sampling  for  the  matrix  comple¬ 
tion  problem  when  the  row  space  is  highly  coherent.  We  show  that  if  the  matrix  has  coherent 
row  space,  then  any  non-interactive  scheme  followed  by  any  recovery  procedure  requires  Q(dn) 
samples  to  recover  ad  x  n  matrix  X. 

To  formalize  our  lower  bound  we  fix  a  sampling  budget  M  and  consider  an  estimator  to  be  a 
sampling  distribution  q  over  {(i,j)\i  <E  [d],j  G  [n]}M  and  a  (possibly  randomized)  function 
/  :  {(H,  Xq)}  — >■  Mdxn  that  maps  a  set  of  indices  and  values  to  a  d  x  n  matrix.  Let  Q(M )  denote 
the  set  of  all  such  sampling  distributions  and  let  T  denote  the  set  of  all  such  estimators.  Lastly 
let  X ( d ,  n.  r,  /i0)  denote  the  set  of  all  d  x  n  rank  r  matrices  with  column  incoherence  at  most  //0. 
We  consider  the  minimax  probability  of  error: 

R*{d,n,r,fi0,M)  =  inf  inf  sup  [/(H,  Xn  ^  X] 

feTqeQ(M)  XeX(d,n,r,iJo) 

where  the  probability  also  accounts  for  potential  randomness  in  the  estimator  /.  Note  that 
since  we  make  no  assumptions  about  the  distribution  q  other  than  excluding  interactive  distribu¬ 
tions,  this  setup  subsumes  essentially  all  non-interactive  sampling  strategies  including  uniform- 
at-random,  deterministic,  and  distributions  sampling  entire  columns.  The  one  exception  is  the 
Bernoulli  sampling  model,  where  each  entry  (i,  j)  is  observed  with  probability  qV]  independently 
of  all  other  entries,  although  we  believe  a  similar  lower  bound  holds  there. 


18 


The  following  theorem  lower  bounds  success  probability  of  any  non-interactive  strategy  and 
consequently  gives  a  necessary  condition  on  the  sample  complexity. 

Theorem  2.3.  The  minimax  risk  R*  satisfies: 


R*(d,  n,  r,  po,  M)  > 


1 

2 


M 


(1 


r—  1 
r/io 


)d 


(2.8) 


which  approaches  1/2  whenever: 

M  —  o  ( ( dn  —  dr){  1  -| — - — )  J  . 

V  rho  ho  J 


(2.9) 


As  a  concrete  instantiation  of  the  theorem,  if  p0  is  bounded  from  below  by  any  constant  c  >  1 
(which  is  possible  whenever  r  <  d/c),  then  the  bound  approaches  1/2  whenever  M  =  o(d(n  — 
r)).  Thus  all  non-interactive  algorithms  must  have  sample  complexity  that  is  quadratic  in  the 
problem  dimension.  In  contrast,  Corollary  2.2  ensures  that  Algorithm  2  has  nearly  linear  sample 
complexity,  which  is  a  significant  improvement  over  non-interactive  algorithms. 

The  literature  contains  several  other  necessary  conditions  on  the  sample  complexity  for  matrix 
completion.  A  simple  argument  shows  that  without  any  form  of  incoherence,  one  requires  fl(dn) 
samples  to  recover  even  a  rank  one  matrix  that  is  non-zero  in  just  one  entry.  This  argument 
also  applies  to  interactive  sampling  strategies  and  shows  that  some  measure  of  incoherence  is 
necessary.  With  both  row  and  column  incoherence,  but  under  uniform  sampling,  Candes  and 
Tao  [  ]  prove  that  Q(p'0nr  log(n))  observations  are  necessary  to  recover  a  n  x  n  matrix. 

One  can  relax  the  incoherence  assumption  by  non-uniform  non-interactive  sampling,  although 
the  sampling  distribution  is  matrix-specific  as  it  depends  on  the  local  coherence  structure  [52]. 
Unfortunately,  one  cannot  compute  the  appropriate  sampling  distribution,  before  taking  any  mea¬ 
surements.  Our  result  shows  that  in  the  absence  of  row-space  incoherence,  there  is  no  universal 
non-interactive  sampling  scheme  that  can  achieve  a  non-trivial  sample  complexity.  Thus  interac¬ 
tivity  is  necessary  to  relax  the  incoherence  assumption  in  completion  problems. 


Turning  to  necessary  conditions  for  tensor  completion,  we  adapt  the  proof  of  Candes  and  Tao  [  ] 

to  this  setting  and  establish  the  following  lower  bound  for  uniform  sampling: 

Theorem  2.4.  Fix  1  <  m,  r  <  minf  nt  and  po  >  1.  Fix  0  <  S  <  1/2  and  suppose  that  we  do  not 
have  the  condition: 


~  log 


> 


T—  1  T—  1 

ho  r 


(2.10) 


Then  there  exist  infinitely  many  pairs  of  distinct  n\  x  ...  x  ut  order-T  tensors  X  /  X'  of  rank 
r  with  coherence  parameter  <  /i0  such  that  Vq(K)  =  Vq(K')  with  probability  at  least  5.  Each 

entry  is  observed  independently  with  probability  p  =  Tjfi1 — . 

II  : 


Theorem  2.4  implies  that  as  long  as  the  right  hand  side  of  Equation  2.10  is  at  most  e  <  1,  and: 

m  <  nirT"Vo_1  log  (1  -  e/2)  (2.11) 
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Algorithm  3  Low  Rank  Approximation  (A",  m1;  m2) 


1.  Pass  1:  For  each  column,  observe  of  size  mi  uniformly  at  random  with  replacement 
and  estimate  ct  =  ^-| \xt,nt  1 1|.  Estimate  f  =  Y,t  U- 

2.  Pass  2:  Set  A  =  0  e  Mdxn. 

(a)  For  each  column  ay,  sample  777.2,/:  =  m2nc2/ f  observations  U22  uniformly  at  random 
with  replacement. 

(b)  Update  X  =  X  +  (Un2  txt)eJ. 

3.  Compute  the  SVD  of  X  and  output  X  which  is  formed  by  the  top-r  ranks  of  X. 


then  with  probability  at  least  5  there  are  infinitely  many  matrices  that  agree  on  the  observed 
entries.  The  expected  number  of  samples  observed  is  m  This  gives  a  necessary  condition  on 
the  number  of  samples  required  for  tensor  completion.  Comparing  with  Theorem  2.1  shows 
that  our  procedure  outperforms  any  non-interactive  procedure  in  its  dependence  on  the  tensor 
dimensions,  as  our  bound  do  not  include  a  log(n)  factor.  Note  that  our  guarantee  matches  the 
polynomial  terms  in  this  lower  bound  in  its  dependence  on  n,  r,  /i0,  although  the  dependence  on 
the  tensor  order  T  is  better  here. 


2.3  Matrix  Approximation 


For  the  matrix  approximation  problem,  we  propose  an  interactive  sampling  algorithm  to  obtain 
a  low-rank  approximation  to  X.  The  algorithm  (see  Algorithm  3  for  pseudocode)  makes  two 
passes  through  the  columns  of  the  matrix.  In  the  first  pass,  it  subsamples  each  column  uniformly 
at  random  and  estimates  each  column  norm  and  the  matrix  Frobenius  norm.  In  the  second  pass, 
the  algorithm  samples  additional  observations  from  each  column,  and  for  each  t,  places  the 
rescaled  zero-filled  vector  TZq2  txt  into  the  /  th  column  of  a  new  matrix  X,  which  is  a  preliminary 
estimate  of  the  input,  X.  Once  the  initial  estimate  X  is  computed,  the  algorithm  zeros  out  all  but 
the  top  r  ranks  of  X  to  form  X.  We  will  show  that  X  has  low  excess  risk,  when  compared  with 
the  best  rank-r  approximation,  Xr. 

A  crucial  feature  of  the  second  pass  is  that  the  number  of  samples  per  column  is  proportional  to 
the  squared  norm  of  that  column.  Of  course  this  sampling  strategy  is  only  possible  if  the  column 
norms  are  known,  motivating  the  first  pass  of  the  algorithm,  where  we  estimate  precisely  this 
sampling  distribution.  This  feature  allows  the  algorithm  to  tolerate  highly  non-uniform  column 
norms,  as  it  focuses  measurements  on  high-energy  columns,  and  leads  to  significantly  better 
approximation.  This  idea  has  been  used  before,  although  only  in  the  exactly  low-rank  case  [52]. 

For  the  main  performance  guarantee,  we  only  assume  that  the  matrix  has  incoherent  columns, 
that  is  dllx/H^/Haylla  <  /i  for  each  column  xt.  In  particular  we  make  no  additional  assumptions 
about  the  high-rank  structure  of  the  matrix.  We  have  the  following  theorem: 
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Theorem  2.5.  Set  m\  >  32 p  log(n/5)  and  assume  n  >  d  and  that  X  has  p- incoherent  columns. 
With  probability  >1  —  25,  Algorithm  3  computes  an  approximation  X  such  that: 

-  *11'  +  M'  (^)  +  PtOF) 

using  n(mi  +  m2 )  samples.  In  other  words,  the  output  X  satisfies  ||X  —  X\\j?  <  ||X  —  Xr\\p  + 
6 1 1  A"  1 1  /r  with  probability  >1  —  25  and  with  sample  complexity: 

32n/ilog(n/5)  4 - —nrp  log2  ( — — — j  .  (2.12) 

The  proof  is  deferred  to  Section  2.4.  The  theorem  shows  that  the  matrix  X  serves  as  nearly  as 
good  an  approximation  to  X  as  Xr.  Specifically,  with  0(nrp  log2 (5  +  n))  observations,  one  can 
compute  a  suitable  approximation  to  X.  The  running  time  of  the  algorithm  is  dominated  by  the 
cost  of  computing  the  truncated  SVD,  which  is  at  most  0(d2n). 

While  the  dependence  between  the  number  of  samples  and  the  problem  parameters  n,r,  and 
p  is  quite  mild  and  matches  existing  matrix  completion  results,  the  dependence  on  the  error  e 
in  Equation  2.12  seems  undesirable.  This  dependence  arises  from  our  translation  of  a  bound  on 
|| X  —  X || 2  into  abound  on  \\X  —  X\\F,  which  results  in  the  m/  '^'-dependence  in  the  error  bound. 
We  are  not  aware  of  better  results  in  the  general  setting,  but  a  number  of  tighter  translations  are 
possible  under  various  assumptions.  We  mention  just  two  such  results  here. 

Proposition  2.6.  Under  the  same  assumptions  as  Theorem  2.5,  suppose  further  that  X  has  rank 
at  most  r.  Then  with  probability  >  1  —  25: 

HAT  -  X\\F  <  20\\X\\f^.  log 

This  proposition  tempers  the  dependence  on  the  error  e  from  1/e4  to  1/e2  in  the  event  that  the 
input  matrix  has  rank  at  most  r.  This  gives  a  relative  error  guarantee  for  Algorithm  3  on  the 
matrix  completion  problem,  which  improves  on  the  one  implied  by  Theorem  2.5.  Note  that  this 
guarantee  is  weaker  than  Corollary  2.2,  but  Algorithm  3  is  much  more  robust  to  relaxations  of 
the  low  rank  assumption  as  demonstrated  in  Theorem  2.5. 

A  similarly  mild  dependence  on  e  can  be  derived  under  the  assumption  that  X  =  A  +  R,  A 
has  rank  r  and  R  is  some  perturbation,  which  has  the  flavor  of  existing  noisy  matrix  completion 
results.  Here,  it  is  natural  to  recover  the  parameter  A  rather  than  the  top  r  ranks  of  X  and  we 
have  the  following  parameter  recovery  guarantee  for  Algorithm  3: 

Proposition  2.1 .  Let  X  =  A  +  R  where  A  has  rank  at  most  r.  Suppose  further  that  X  has 
p-incoherent  columns  and  set  mi  >  32//  log(n/5).  Then  with  probability  >  1  —  25: 

\\X  -  A\\f  <  20<[^\og(^)  (||A||f  +  ||.Rn||.F)  +  Vfr\\Rh  (2.13) 
\  m2  V  5  / 

where  the  number  of  samples  is  nfrni  +  m2)  and  Q  is  the  set  of  all  entries  observed  over  the 
course  of  the  algorithm. 
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To  interpret  this  bound,  let  ||A||^  =  1,  and  let  R  be  a  random  matrix  whose  entries  are  indepen¬ 
dently  drawn  from  a  Gaussian  distribution  with  variance  a2 /(dn).  Note  that  this  normalization 
for  the  variance  is  appropriate  in  the  high-dimensional  setting  where  n,  d  — >  oo,  since  we  keep 
the  signal-to-noise  ratio  ||J4|||,/||f?||p  =  1/a2  constant.  The  last  term  can  be  ignored,  since  by 

the  standard  bound  on  the  spectral  norm  of  a  Gaussian  matrix,  ||i?||2  =  log((n  +  d)/5 )) 

which  will  be  lower  order  [  ].  We  can  also  bound  \\Rq  \\f  <  0(a^**¥*\og((n  +  d)/5))  using 
a  Gaussian  tail  bound.  With  m\  <  rn2  we  arrive  at: 


\\X-.A\\F<c* 


d  +  n\ 


where  c*  is  some  positive  constant.  In  the  high  dimensional  setting,  when  r/i  =  o(d),  this  shows 
that  Algorithm  3  consistently  recovers  A  as  long  as  m2  =  u(rfi).  This  second  condition  implies 
that  the  total  number  of  samples  uses  is  £o(nrfi). 


2.3.1  Comparison  with  related  results 

The  closest  result  to  Theorem  2.5  is  the  result  of  Koltchinskii  et  al.  [  19]  who  consider  a  soft- 
thresholding  procedure  and  bound  the  approximation  error  in  squared-Frobenius  norm.  They 
assume  that  the  matrix  has  bounded  entry-wise  t ^  norm  and  give  an  entry-wise  squared-error 
guarantee  of  the  form: 

||A'  -  X\\2f  <  ||.Y  -  XrfF  +  cdn\\X\\lnrl°S^  +  n)  (2.14) 

where  M  is  the  total  number  of  samples  and  c  is  a  constant.  Their  bound  is  quite  similar  to 
ours  in  the  relationship  between  the  number  of  samples  and  the  target  rank  r.  However,  since 
dn\\ AG || ^  >  ||  A" ||  p.  their  bound  is  significantly  worse  in  the  event  that  the  energy  of  the  matrix 
is  concentrated  on  a  few  columns. 

To  make  this  concrete,  fix  ||X||,p  =  1  and  let  us  compare  the  matrix  where  every  entry  is  ^== 
with  the  matrix  where  one  column  has  all  entries  equal  to  In  the  former,  the  error  term  in 
the  squared-Frobenius  error  bound  of  Koltchinskii  et  al.  is  nr  log(d  +  n)  /M  while  our  bound  on 
Frobenius  error  is,  modulo  logarithmic  factors,  the  square  root  of  this  quantity.  In  this  example, 
the  two  results  are  essentially  equivalent.  For  the  second  matrix,  their  bound  deteriorates  signifi¬ 
cantly  to  n2r  \og(d+n)  /M  while  our  bound  remains  the  same.  Thus  our  algorithm  is  particularly 
suited  to  handle  matrices  with  non-uniform  column  norms. 

Apart  from  interactive  sampling,  the  difference  between  our  procedure  and  the  algorithm  of 
Koltchinskii  et  al.  [  19]  is  a  matter  of  soft-  versus  hard-thresholding  of  the  singular  values  of  the 
zero-filled  matrix.  In  the  setting  of  Proposition  2.7,  soft  thresholding  seems  more  appropriate,  as 
the  choice  of  regularization  parameter  allows  one  to  trade  off  the  amount  of  signal  and  noise  cap¬ 
tured  in  X.  While  in  practice  one  could  replace  the  hard  thresholding  step  with  soft  thresholding 
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in  our  algorithm,  there  are  some  caveats  with  the  theoretical  analysis.  First,  soft-thresholding 
does  not  ensure  that  X  will  be  at  most  rank  r,  so  it  is  not  suitable  for  the  matrix  approxima¬ 
tion  problem.  Second,  the  resulting  error  guarantee  depends  on  the  sampling  distribution,  which 
cannot  be  translated  to  the  Frobenius  norm  unless  the  distribution  is  quite  uniform  [  ,  ]. 

Thus  the  soft-thresholding  procedure  does  not  give  a  Frobenius-norm  error  guarantee  in  the  non- 
uniform  setting  that  we  are  most  interested  in. 

The  majority  of  other  results  on  low  rank  matrix  completion  focus  on  parameter  recovery  rather 
than  approximation  [41,  117,  13  ].  It  is  therefore  best  to  compare  with  Proposition  2.7,  where 
we  show  that  Algorithm  3  consistently  recovers  the  parameter,  A.  These  results  exhibit  similar 
dependence  between  the  number  of  samples  and  the  problem  parameters  n,  r,  e  but  hold  under 
different  notions  of  uniformity,  such  as  spikiness,  boundedness,  or  incoherence.  Our  result  agrees 
with  these  existing  results  but  holds  under  a  much  weaker  notion  of  uniformity. 

Lastly,  we  emphasize  the  effect  of  interactive  sampling  in  our  bound.  We  do  not  need  any 
uniformity  assumption  over  the  columns  of  the  input  matrix  X.  All  existing  works  on  noisy  low 
rank  matrix  completion  or  matrix  approximation  from  missing  data  have  some  assumption  of 
this  form,  be  it  incoherence  [  ,11],  spikiness  [134],  or  bounded  (X  norm  [119].  The  detailed 

comparison  with  the  result  of  Koltchinskii  et  al.  gives  a  precise  characterization  of  this  effect 
and  shows  that  in  the  absence  of  such  uniformity,  our  interactive  sampling  algorithm  enjoys  a 
significantly  lower  sample  complexity. 

In  the  event  of  uniformity,  our  algorithm  performs  similarly  to  existing  ones.  Specifically,  we 
obtain  the  same  relationship  between  the  number  of  samples  M,  the  dimensions  n,  d  and  the 
target  rank  r.  If  we  knew  a  priori  that  the  matrix  had  uniform  column  lengths,  we  could  omit  the 
first  pass  of  the  algorithm,  sample  uniformly  in  the  second  pass  and  avoid  interactivity. 


2.4  Proofs 


In  this  section  we  provide  detailed  proofs  of  the  results  in  this  section.  Some  well  known  large- 
deviation  inequalities,  that  are  used  throughout  this  thesis,  are  stated  in  the  appendix. 


2.4.1  Proof  of  Theorem  2.1  and  Corollary  2.2 

Before  turning  to  the  proofs  of  Theorem  2. 1  and  Corollary  2.2,  we  prove  several  results  involving 
incoherence  and  the  concentration  of  orthogonal  projections  under  random  subsampling. 


Intermediary  Results  for  Theorem  2.1  and  Corollary  2.2 

This  first  intermediary  result  shows  that  the  test  statistic  used  in  Algorithm  2  concentrates  sharply 
around  its  mean.  Specifically,  this  theorem  analyzes  the  test  based  on  the  projection  ||xq  — 
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VUn xq  1 1 2 .  The  proof  of  this  theorem  uses  various  versions  of  Bernstein’s  inequality,  and  im¬ 
proves  on  the  result  of  Balzano  et  al.  [25].  It  is  the  key  ingredient  to  the  analysis  of  these 
algorithms. 

Theorem  2.8.  Let  U  be  an  r-dimensional  subspace  of  Rd  and  y  =  x  +  v  where  x  G  U  and 
v  e  U1-.  Fix  5  >  0  and  m  >  max{| rp(U)  log(2 r/S),  4 p{v)  log(l/5)}  and  let  f)  be  an  index  set 
ofm  entries  sampled  uniformly  with  replacement  from  \d].  With  probability  >  1  —  45: 


(2.15) 


where  a  =  y/2^  log(l/5)  +  ^  log(l/5),  /3  =  (1+2  log(l/5))2,  and  7  =  \J^p-  log(2r/5). 


This  result  showcases  much  stronger  concentration  of  measure  than  the  result  of  Balzano  et 
al.  [25].  The  main  difference  is  in  the  definitions  of  a  and  f3,  which  in  their  work  have  worse 
dependence  on  the  coherence  parameter  p,(v).  These  improvements  play  out  into  our  stronger 
sample  complexity  guarantee  for  the  matrix  and  tensor  completion  algorithms. 

The  proof  of  this  theorem  is  based  on  three  deviation  bounds  controlling  the  effect  of  subsam¬ 
pling.  We  state  and  prove  these  lemmas  before  turning  to  the  proof  of  Theorem  2.8. 

Lemma  2.9.  With  the  same  notations  as  in  Theorem  2.8,  with  probability  >  1  —  25: 

777  777 

(l-a)^\\v\\l<\\vn\\l<(l  +  a)-\\v\\l  (2.16) 


Proof.  The  proof  is  an  application  of  Bernstein’s  inequality  (Theorem  A.l).  Let  Q(i)  denote  the 
ith  coordinate  in  the  sample  and  let  Xt  =  v ^  ||f  ||1  so  that  Y1T=  i  =  IK^II!  —  'flMII-  The 
variance  and  absolute  bounds  are: 


m 


a 


=  >  EX 


i=  1 


m 
~  n 


y 

i=  1 


<  m 
~  n 


1 2  I 
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R  =  max  ||Xj||  < 


1 2 

I  OO  ' 


Bernstein’s  Inequality  then  shows  that: 


P 


X> 


1=1 


>  t  <  2  exp 


-t1 


2||«ll5o(?l|o|ll  +  li 


Setting  t  =  <  ||u||!  and  using  the  definition  p{v)  =  d||u||^0/||'i;|||  this  bound  becomes: 


P 


2=1 


m 


<  2  exp 


—a 


2p(v)(l  +  a/3) 


And  plugging  in  the  definition  of  a  ensures  that  the  probability  is  upper  bounded  by  25. 
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Lemma  2.10.  With  the  same  notation  as  Theorem  2.8  and  provided  that  m  >  4// (v)  log(l/5), 
with  probability  at  least  1  —  5: 


114% 


2 
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m  rp{U ) 
d  d 


(2.17) 


Proof.  The  proof  is  an  application  of  the  vector  version  of  Bernstein’s  inequality  (Proposi¬ 
tion  A.2.  Let  Ui  G  Mr  denote  the  /th  row  of  an  orthonormal  basis  for  U  and  set  X,  =  Unuwnuy 
Since  v  G  IX.  the  X,s  are  centered  so  we  are  left  to  compute  the  variance: 


£EIW 
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d  d 


v  i  =  V 


Applying  Proposition  A. 2  and  re-arranging,  we  have  that  with  probability  at  least  1  —  5: 

WEvnh  <W+  y/ 4VTog(l/5)  =  ||n||2  (l  +  2^(1  /5) 


As  long  as: 

t  =  sjAV\og{l/5)  <  V (max  ||Ai||)_1 

i 

Since  max,  1 1 X, \ \  <  ||f||oo sjrh/d  and  using  the  incoherence  assumption  on  v  this  condition 

translates  to  m  >  4/r(n)  log(l/5).  Squaring  the  above  inequality  proves  the  lemma. 

Lemma  2.11  ([  ]).  Let  5  >  0  and  m  >  | rp,(U )  log(2r/5).  Then 

(2.18) 

with  probability  at  least  1  —  5  provided  that  7  <  1.  In  particular  UfUn  is  invertible. 


Proof  of  Theorem  2.8.  We  begin  with  the  decomposition: 

\\yn  ~  7VnZ/n||i  =  \\vn\\l  -  ^(^o)”1^.  (2.19) 

Next,  let  Wq  Wq  =  (UflX)^] ,  which  is  valid  provided  that  UflX  is  invertible  (which  we  will 
subsequently  ensure).  We  have: 

vqUci(UqUq)~1UqVci  =  \\WaU^va\\l  <  llWhllll^Hl  =  ||(C/^n)-1||||^n||^ 

which  means  that: 

IK||l*||(^n)"1||||^n||2  <  \\yn-ruayn\\l  <  |kn|||.  (2.20) 

The  theorem  now  follows  immediately  from  Lemmas  2.9,  2.10,  and  2.11,  which  control  the 
quantities  in  the  above  inequalities. 
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Another  significant  component  of  the  proof  involves  controlling  the  incoherence  of  various  sub¬ 
spaces  that  appear  throughout  the  execution  of  the  algorithm.  The  following  lemmas  control 
precisely  these  quantities. 

Lemma  2.12.  Let  U\  C  Mni,  U2  C  Mn2, . . .  Ut  C  RnT  be  subspaces  of  dimension  at  most  d,  let 
W\  C  U\  have  dimension  d'.  Define  §  =  span({$fj=l  iif)  }f=l ).  Then: 

(a)  n( W,)  < 

ib)  M(s)  <  cF-1  nf„ 


Proof.  For  the  first  property,  since  W\  is  a  subspace  of  U\,  TV,  e3  =  VwfPu^j  so  |  7Vi  e3  \  j \  < 
I  V,  Gj  1 1 2 •  The  result  now  follows  from  the  definition  of  incoherence. 

For  the  second  property,  we  instead  compute  the  incoherence  of: 


§/  =  span({^=i«W}uWer/iVt) 


which  clearly  contains  §.  Note  that  if  {up  }  is  an  orthonormal  basis  for  Ut  (for  each  t),  then  the 
outer  product  of  all  combinations  of  these  vectors  is  a  basis  for  §'.  We  now  compute: 


MS')  = 


nf=r  m 


max 


Ut=idMUt)  fciS[ni],...,fcrS[nr] 


\\'P§'(®t=iekt)\\~ 
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Now,  property  (a)  establishes  that  //(§)  <  ^//(S')  which  is  the  desired  result. 


Theorem  2.8,  Lemma  2.12,  and  some  algebraic  manipulations,  yields  the  following  corollary, 
which  we  use  in  the  analysis  of  the  Algorithm  2: 

Corollary  2.13.  Suppose  that  U  is  a  subspace  of  U  and  xt  G  U  but  xt  U.  Obserx’e  a  set 
of  coordinates  Q  C  [d]  of  m  entries  sampled  uniformly  at  random  with  replacement.  If  m  > 
32r/r0  log2(2r/<5)  then  with  probability  >  1—45,  ||ayn  —  Vp^Xtci || 2  >0.  If  xt  G  U,  then 
conditioned  on  the  fact  that  UflJn  is  invertible,  —  Vp  x  tn|j2  =  0  with  probability  1. 
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Proof.  The  second  statement  follows  from  the  fact  that  if  xt  G  U,  then  xta  G  Uq,  so  the  pro¬ 
jection  onto  the  orthogonal  complement  is  identically  zero.  As  for  the  first  statement,  we  apply 
Theorem  2.8,  noting  that  the  conditions  on  m  are  satisfied. 

We  now  verify  that  the  lower  bound  is  strictly  positive.  By  Lemma  2.12(a),  we  know  that  any 
vector  v  in  U  has  coherence  /j,(v)  <  r/iQ  and  similarly  any  subspace  U  C  U  has  dim (U)/j,(U)  < 
r/i0.  Plugging  in  m  into  the  definition  a,  7,  and  using  the  previous  facts,  we  see  that  a  <  1/2 
and  7  <  1/3.  We  are  left  with: 


xtn  -  Vuaxtsi ||1  > 


1  I'm 

d  V  2" 


3  r/x/3 
2 


and  the  lower  bound  is  strictly  positive  whenever  3 r/i/3  <  m.  Plugging  in  the  definition  of  (3,  we 
see  that  this  relation  is  also  satisfied,  concluding  the  proof. 


Proof  of  Corollary  2.2 

Corollary  2.2  is  considerably  simpler  to  prove  than  Theorem  2.1,  so  we  prove  the  former  in  its 
entirety  before  proceeding  to  the  latter.  First  notice  that  our  estimates  U  for  the  column  space  is 
always  a  subspace  of  the  true  column  space,  since  we  only  ever  add  in  fully  observed  vectors  that 
live  in  the  column  space.  Also  notice  that  we  only  resample  the  set  at  most  r  +  1  times,  since 
the  matrix  is  exactly  rank  r,  and  we  only  resample  when  we  find  a  linearly  independent  column. 
Thus  with  probability  1  —  (r  +  1)5,  by  application  of  Lemma  2.11  from  the  appendix,  all  of  the 
matrices  Uq  Uq  are  invertible. 

When  processing  the  /th  column,  one  of  two  things  can  happen.  Either  xt  lives  in  our  current 
estimate  for  the  column  space,  in  which  case  we  know  from  the  above  corollary  that  with  prob¬ 
ability  1,  —  VUuxtn ||2  =  0.  This  holds  since  we  have  already  accounted  for  the  probability 

that  UqUq  is  not-invertible.  When  this  happens  we  do  not  obtain  additional  samples  and  just 
need  to  ensure  that  we  reconstruct  xt,  which  we  will  see  below.  If  xt  does  not  live  in  U,  then 
with  probability  >1  —  45  the  estimated  projection  is  strictly  positive  (by  Corollary  2.13),  in 
which  case  we  fully  observe  the  new  direction  xt  and  augment  our  subspace  estimate.  In  fact, 
this  failure  probability  includes  the  event  that  UqUq  is  not  invertible. 

Since  X  has  rank  at  most  r,  this  latter  case  can  happen  no  more  than  r  times,  and  via  a  union 
bound,  the  failure  probability  is  <  4r5  +  5.  Here,  the  last  factor  of  5  ensures  that  the  last 
subsampled  projection  operator  is  well  behaved.  In  other  words,  with  probability  >  1  —  4r5  —  5, 
our  estimate  U  at  the  end  of  the  algorithm  is  exactly  the  column  space  of  A". 

The  vectors  that  were  not  fully  observed  are  recovered  exactly  as  long  as  (UqUq)~1  is  invertible. 
This  follows  from  the  fact  that,  if  xt  G  U,  we  can  write  xt  —  Uat  and  we  have: 

xt  =  U(UZUn)-lU[}Uncyt  =  Uat  =  xt 
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We  already  accounted  for  the  probability  that  these  matrices  are  invertible.  We  showed  above 
that  the  total  failure  probability  is  at  most  5 rd  when  m  >  32 r/i0  log2(2r/<5),  so  by  setting  m  > 
32 r/i0  log2(10r2/5),  the  total  failure  probability  is  at  most  d. 

For  the  running  time,  per  column,  the  dominating  computational  costs  involve  the  projection  Vjj 
and  the  reconstruction  procedure.  The  projection  involves  several  matrix  multiplications  and  the 
inversion  of  a  r  x  r  matrix,  which  need  not  be  recomputed  on  every  iteration.  Ignoring  the  matrix 
inversion,  this  procedure  takes  at  most  0(mr)  time  per  column,  since  the  vector  and  the  projector 
are  subsampled  to  m-dimensions,  for  a  total  running  time  of  0(nmr).  At  most  r  times,  we  must 
recompute  (UqUq)-1,  which  takes  0(r2m),  contributing  a  factor  of  0(r3m)  to  the  total  running 
time.  Finally,  we  run  the  Gram-Schmidt  process  once  over  the  course  of  the  algorithm,  which 

takes  0(dr 2)  time. 


Proof  of  Theorem  2.1 

We  now  generalize  the  above  proof  to  the  tensor  completion  case  and  prove  Theorem  2.1.  We 
first  focus  on  the  recovery  of  the  tensor  in  total,  expressing  this  in  terms  of  failure  probabilities  in 
the  recursion.  Then  we  inductively  bound  the  failure  probability  of  the  entire  algorithm.  Finally, 
we  compute  the  total  number  of  observations.  For  now,  define  tt  to  be  the  failure  probability  of 
recovering  a  T-order  tensor. 

By  Lemma  2.12,  the  subspace  spanned  by  the  mode-T  tensors  has  incoherence  at  most  rT~  Vo -1 
and  rank  at  most  r  and  each  slice  has  incoherence  at  most  rT~1^ -1.  The  subspace  spanned  by 
the  mode-T  sub-tensors  is  based  on  the  outer  product  of  the  subspaces  so  it  is  based  on 

the  outer  product  of  T  —  1  subspaces,  all  with  coherence  bounded  by  /x0  and  dimension  at  most 
r.  This  means  that  the  subspace  spanned  by  the  mode-T  subtensors  has  incoherence  rT_2/^o  _1 
and  each  slice  is  a  1-dimensional  subspace  of  this  r-dimensional  subspace,  so  it  has  incoherence 
that  is  a  factor  of  r  larger. 

By  the  same  argument  as  Corollary  2.13,  we  see  that  with  nrix- 1  >  32 rT~1^~1  log2(2r/<5T-i) 
the  projection  test  succeeds  in  identifying  informative  subtensors  (those  not  in  our  current  basis) 
with  probability  >  1  —  A5t-i-  With  a  union  bound  over  these  r  subtensors,  the  failure  probability 
becomes  rT  <  Ardr-i  +  5t- i,  not  counting  the  probability  that  we  fail  in  recovering  these 
subtensors,  which  is  vtt- i- 

For  each  order  T  —  1  tensor  that  we  have  to  recover,  the  subspace  of  interest  has  incoherence 
at  most  rT~3/iT~2  and  with  probability  >  1  —  AtSt-2  we  correctly  identify  each  informative 
subtensor  as  long  as  rriT- 2  >  32rT~2/rT~2  log2(2r/<5T_2).  Again  the  failure  probability  is  at 
most  Trp_\  A  At5x—2  T  $t— 2  T  vtt— 2- 

To  compute  the  total  failure  probability  we  proceed  inductively.  T\  =  0  since  we  completely 
observe  any  one-mode  tensor  (vector).  The  recurrence  relation  is: 

Tt  =  4r5f_i  +  5t-i  +  rrt-i.  (2.21) 
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Which  in  words  means  that  we  complete  r  subtensors  of  order  T  —  1,  r2  tensors  of  order  T  —  2 
and  so  on,  observing  rT  1  order  1  tensors  (or  vectors)  in  full.  The  total  failure  probability  is 
therefore  bounded  by: 

T—l 

tt  =  YJ$rT~t5t-  (2.22) 

t= i 

The  requirement  on  rnt  is: 


mt  >  32rtnt0\og2(2rt/8t). 

To  achieve  risk  at  most  5,  one  can  set  mt  >  32Trt/dt0  log2(10rT/5),  which  concludes  the  proof 
of  the  statistical  guarantee  for  Algorithm  2. 

We  also  compute  the  sample  complexity  inductively.  Let  r]T  denote  the  number  of  samples 
needed  to  complete  an  order  T  tensor.  Then  r/i  =  n  j  and: 

Vt  =  ntmt- 1  +  rr]t-i 


So  that  r]T  is  upper  bounded  as: 

rit  \  fT~Vo_1  log2(10rT/<5) 

when  we  set  mt  as  above. 

The  running  time  is  computed  in  a  similar  way  to  the  matrix  case.  To  complete  an  order  T 
tensor,  we  must  complete  r  order  T  —  l  tensors,  and  additional  process  each  subtensor.  As  in  the 
matrix  case,  processing  all  of  the  rir  subtensors  requires  rriT_ ,  r  time  per  column  to  do  all  vector 
and  matrix  multiplications,  0(r3mT_i )  time  to  do  the  matrix  inversions,  and  (){r2  nt)  to 
perform  Gram-Schmidt.  If  the  running  time  to  complete  a  order  t  tensors  is  denote  nt,  then  the 
running  time  is  inductively  defined  as: 


Vt  S 


<  '^2,ntmt_1rT  t  <  32 T  ( 


t= l 


.  t=i 


Kt  =  TKt- 1  +  O 


t—l 

ntmt-ir  +  r3mt-i  +  r2 

i=l 


Ui 


(2.23) 


with  K\  —  ri\.  Using  the  fact  that  rnt  =  0{rl )  and  that  r  <  mint{rit},  the  total  running  time  can 
be  bounded  by: 


6 


ntrT  +  Tr2+T  +  r2 


j.= i 


where  we  are  treating  /i0  as  a  constant  and  ignoring  logarithmic  factors. 
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2.4.2  Proof  of  Theorem  2.3 


The  proof  of  the  necessary  condition  in  Theorem  2.3  is  based  on  a  standard  reduction-to-testing 
style  argument.  For  ease  of  notation,  we  suppress  the  dependence  on  the  parameters  to  R*,  Q. 
and  X .  The  high-level  architecture  is  to  consider  a  subset  X'  C  X  of  inputs  and  lower  bound  the 
Bayes  risk.  Specifically,  if  we  fix  a  prior  i r  supported  on  X' , 

R*  =  inf  inf  maxP n~q[f(£l,Xn)  ^  X\ 

f&Fq&QX&X  yL 

>  inf  inf  En~5l^[P/[/(fi,An)  ±  X\\ 

f&Fq&Q 

>inf  min  Ex^\Pf[f(Q,Xn)  ^  X]] 
fexn-.\n\=M  J 

The  first  step  is  a  standard  one  in  information  theoretic  lower  bounds  and  follows  from  the  fact 
that  the  maximum  dominates  any  expectation  over  the  same  set.  The  second  step  is  referred  to 
as  Yao’s  Minimax  Principle  in  the  analysis  of  randomized  algorithms,  which  says  that  one  need 
only  consider  deterministic  algorithms  if  the  input  is  randomized.  It  is  easily  verified  by  the  fact 
that  in  the  second  line,  the  inner  expression  is  linear  in  q,  so  it  is  minimized  on  the  boundary 
of  the  simplex,  which  is  a  deterministic  choice  of  fl  We  use  P/  to  emphasize  that  /  can  be 
randomized,  although  it  will  suffice  to  consider  deterministic  /. 

Let  n  be  the  uniform  distribution  over  X'  C  X.  The  minimax  risk  is  lower  bounded  by: 

R*>1-  max Ex~ti- | {X1  G  X'\X'n  =  An}!"1 

since  if  there  is  more  than  one  matrix  in  X'  that  agrees  with  X  on  fi,  the  best  any  estimator  could 
do  is  guess.  Notice  that  since  X  is  drawn  uniformly,  this  is  equivalent  to  considering  an  /  that 
deterministically  picks  one  matrix  X'  e  X'  that  agrees  with  the  observations. 

To  upper  bound  the  second  term,  define  ZA>  =  {X  e  X'  :  {A"'  e  X'\X'{1  =  Xq }  =1}  which  is 
the  set  of  matrices  that  are  uniquely  identified  by  the  entries  fi.  Also  set  AA  =  X'  \U<>,  which  is 
the  set  of  matrices  that  are  not  uniquely  identified  by  Q.  We  may  write: 

maxEx^KA'  e  X'\X'n  =  Aq}]-1  <  max  ^ 

Since  if  A  e  A/q,  there  are  at  least  two  matrices  that  agree  on  those  observations,  so  the  best 
estimator  is  correct  with  probability  no  more  than  1/2. 

We  now  turn  to  constructing  a  set  X' .  Set  l  —  The  left  singular  vectors  u±, . . . ,  wr_i  will  be 
constant  on  {1, {I  + 1, ... ,  21}  etc.  while  the  first  r  —  1  right  singular  vectors  v1: . . . ,  ty_i 
will  be  the  first  r  —  1  standard  basis  elements.  We  are  left  with: 

d  —  (r  —  1)1  —  d - =  dci, 

r  fi0 

coordinates  where  we  will  attempt  to  hide  the  last  left  singular  vector.  Here  we  defined  ci  = 
1  —  — ,  which  is  not  a  constant,  but  will  ease  the  presentation.  For  ur,  we  pick  l  coordinates 
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out  of  the  dc\  remaining,  pick  a  sign  for  each  and  let  ur  have  constant  magnitude  on  those 
coordinates.  There  are  2l  (*1)  possible  choices  for  this  vector.  The  last  right  singular  vector  is 
one  of  the  n  —  r  remaining  standard  basis  vectors.  Notice  that  our  choice  of  I  ensures  that  every 
matrix  in  this  family  meets  the  column  space  incoherence  condition. 

To  upper  bound  \Uq\  notice  that  since  ur  can  have  both  positive  and  negative  signs,  a  matrix  is 
uniquely  identified  only  if  all  of  the  entries  corresponding  to  the  last  singular  vector  are  observed. 
Thus  observations  in  the  tth  column  only  help  to  identify  matrices  whose  last  rank  was  hidden  in 
that  column.  If  we  use  rnt  observations  on  the  tth  column,  we  uniquely  identify  21  matrices, 
where  =  0  if  mt  <  l.  In  total  we  have: 

!*'!  =  («  — and  \Ua |  =  2'  £  (”‘) 


We  are  free  to  choose  m;  to  maximize  \U<>\  subject  to  the  constraints  rri,  <  dc \  and  ]Tk  rri,  <  M, 
the  total  sensing  budget.  Optimizing  over  m*  is  a  convex  maximization  problem  with  linear  con¬ 
straints,  and  consequently  the  solution  is  on  the  boundary.  By  symmetry,  this  means  that  that  best 
sampling  pattern  is  to  observe  columns  in  their  entirety  and  devote  the  remaining  observations 
to  one  more  column.  With  M  observations,  we  can  observe  ^  columns  fully,  leading  to  the 
bounds: 


\Un\  <2l\ 


M 
c  \  n 


and 


\X'\  C\n  712  ~  r 


which,  after  plugging  in  for  ci,  leads  to  the  lower  bound  on  the  risk. 


2.4.3  Proof  of  Theorem  2.4 

We  start  by  giving  a  proof  in  the  matrix  case,  which  is  a  minor  correction  of  the  proof  by  Candes 
and  Tao  [  13].  Then  we  turn  to  the  tensor  case,  where  only  small  adjustments  are  needed  to 
establish  the  result.  We  work  in  the  Bernoulli  model,  noting  that  Candes’  and  Tao’s  arguments 
demonstrate  how  to  adapt  these  results  to  the  uniform- at-random  sampling  model. 


Matrix  Case 

In  the  matrix  case,  suppose  that  li  —  —  and  l2  —  —  are  both  integers.  Define  the  following 
blocks  Ri,...Rr  C  [rii]  and  Ci, . . .  Cr  C  [n2\  as: 

R-i  =  (M*  —  1)  +  1,  /i(f  —  1)  +  2, . . .  l\i } 

C\  =  {l2 {%  —  1)  +  1,  [2(1  —  1)  +  2, . . .  l2i} 
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Now  consider  the  n j  x  n2  family  of  matrices  defined  by: 


M  =  {^2,ukvl\uk  =  [1,  o  lRk,vk  =  lCk}.  (2.24) 

k= 1 


The  o  operator  is  the  Hadamard  operator,  which  performs  entry-wise  multiplication.  M.  is  a 
family  of  block-diagonal  matrices  where  the  blocks  have  size  l\  x  l2.  Each  block  has  constant 
rows,  but  each  row  may  take  any  value  in  [1,  y/jlo].  For  any  M  e  M,  the  incoherence  of  the 
column  space  can  be  computed  as: 


MV) 


—  max  \\VueA\l 
r  je[n i] 


(ute 


n  i 

—  max  max  T  ^  . 

r  ke[r]  je[ni]  {u%uky  r  ke[r]  {ni/r) 


,  n  i 
<  —  max 


A  similar  calculation  reveals  that  the  row  space  is  also  incoherent  with  parameter  /i0. 

Unique  identification  of  M  is  not  possible  unless  we  observe  at  least  one  entry  from  each  row 
of  each  diagonal  block.  If  we  did  not,  then  we  could  vary  that  corresponding  coordinate  in  the 
appropriate  uk  and  find  infinitely  many  matrices  M'  e  A4  that  agree  with  our  observations,  have 
rank  and  incoherence  at  most  r  and  /i0  respectively.  Thus,  the  probability  of  successful  recovery 
is  no  larger  than  the  probability  of  observing  one  entry  of  each  row  of  each  diagonal  block. 

The  probability  that  any  row  of  any  block  is  unsampled  is  7Ti  =  (1  —p)'2  and  the  probability  that 
all  rows  are  sampled  is  (1  —  7Ti)ni.  This  must  upper  bound  the  success  probability  1  —  5.  Thus: 


— 7li7Ti  >  7l|  log(l  —  7Ti)  >  log(l  —  5)  >  —25 
or  7Ti  <  25/rii  as  long  as  5  <  1/2.  Substituting  Hi  =  (1  —  p)12  we  obtain: 


log(l 


x  ^  1  ,  /2<5\  fi0r  f  25  \ 

-  P)  <  T-  log  —  =  -  log  — 

h  \n\ )  n2  \nij 


as  a  necessary  condition  for  unique  identification  of  M. 

Exponentiating  both  sides,  writing  p  =  and  the  fact  that  1  —  e~x  >  x  —  x1  j 2  gives  us: 


m  >  nip0r  log  (J^j  (1  -  e/2) 


when  p0r/n2  log(f^)  <  e  <  1. 


Tensor  Case 

Fix  T,  the  order  of  the  tensor  and  suppose  that  li  —  —  is  an  integer.  Moreover,  suppose  that 
It  —  —  is  an  integer  for  1  <  t  <  T. 

L  LLnr  ®  — 
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Define  a  set  of  blocks,  one  for  each  mode  and  the  family 


B?  =  {lt(i  -  1)  +  1  ,lt(i  -  1)  +  2, . . .  ,lti}  Vi  G  [r],t  e  [T] 


M=  E1 


h=i  ai 


(*) 


2=1 


«!1)  =  [i,vW»i, 


,(i) 


af}  =  lB(t),l  <  t  <  T 


This  is  a  family  of  block-diagonal  tensors  and  just  as  before,  straightforward  calculations  reveal 
that  each  subspace  is  incoherent  with  parameter  /i0.  Again,  unique  identification  is  not  possible 
unless  we  observe  at  least  one  entry  from  each  row  of  each  diagonal  block.  The  difference  is  that 
in  the  tensor  case,  there  are  n*i  U  entries  per  row  of  each  diagonal  block  so  the  probability  that 
any  single  row  is  unsampled  is  =  (1  —  p)  fk*1*’.  Again  there  are  rii  rows  and  any  algorithm 
that  succeeds  with  probability  1  —  5  must  satisfy: 

— TliTTi  >  Til  log(l  —  7T ! )  >  log(l  —  5)  >  —25 


Which  implies  7Ti  <  25/ni  (assuming  5  <  1/2).  Substituting  in  the  definition  of  tti  we  have: 


log(l  -p)< 


log  I  -  I  =  'l  '|J  1  log  (  - 


n^i  u  \ni  j  n 


iV  1 


(3 


The  same  approximations  as  before  yield  the  bound  (as  long  as 


,.T-1„T- 1 
Bp 


n 


rii 


log(i)  <  e  <  1) 


m  >  m(j%  1rT  1  log  (1  -  e/2). 


2.4.4  Proof  of  Theorem  2.5  and  related  propositions 

To  prove  the  main  approximation  theorem  (Theorem  2.5),  we  must  analyze  the  three  phases  of 
the  algorithm.  The  analysis  of  the  first  phase  is  fairly  straightforward;  we  show  that  under  the 
incoherence  assumption,  one  can  compute  a  reliable  estimate  of  each  column  norm  from  a  very 
small  number  of  measurements  per  column.  For  the  second  phase,  we  show  that  by  sampling 
according  to  the  re- weighted  distribution  using  the  column-norm  estimates,  the  matrix  X  is  close 
to  X  in  spectral  norm.  We  then  translate  this  spectral  norm  guarantee  into  a  approximation 
guarantee  for  X  =  Xr. 

Let  us  start  with  this  translation.  We  use  a  lemma  of  [  ]. 

Lemma  2.14  ([1]).  Let  A  and  N  be  any  matrices  and  write  A  =  A  +  N.  Then: 

\\A  —  Ak 1 1 2  <  \\A  —  Ak || 2  +  2 1 1 iVfc 1 1 2 

||2l  —  Ak\\F  <  ||A  —  Ak\\F  +  ||A^fc||F  +  2^|| A^||ir||Afc||p’ 
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The  lemma  states  that  if  A  —  A  is  small,  then  the  top  k  ranks  of  A  is  nearly  as  good  an  approx¬ 
imation  to  A  as  is  the  top  k  ranks  of  A  itself.  Notice  that  all  of  the  error  terms  only  depend  on 
rank-/,-  matrices.  We  will  use  this  lemma  with  *  and  A"  and  of  course  with  the  target  rank  as  r. 
We  will  soon  show  that  ||A  —  *||2  <  e||*||F,  which  implies: 

||*  -  *||f  <  ||*  -  *r||  +  ||(*  -  *)r||F  +  2^/||(*-*)r||F||*r||F 

<  II*  -  *r||  +  v^ll*  -  *||2  +  2^11*  -  *||2||*||f 

<  II*  -  *r||  +  H*||f  {V^e  +  2 r1/4e1/2)  (2.25) 

So  if  we  can  obtain  a  bound  on  ||*  —  *||2  of  that  form,  we  will  have  proved  the  theorem. 

As  for  Propositions  2.6  and  2.7,  the  translation  uses  the  first  inequality  of  Achlioptas  and  McSh- 
erry  [1],  If  *  is  rank  r,  the  matrix  *  —  *  has  rank  at  most  2 r,  which  means  that: 

II*  -  *||f  <  V¥\\X  -  *||2  <  2v/2r||*  -  *||2  <  2V^e||*||F 

For  the  second  proposition,  we  first  bound  ||*  —  M ||2  and  then  use  the  same  argument. 

||*  -  M||2  <  ||*  -  *||2  +  ||i?||2  <  ||*  -  *r||2  +  2e||*||F  +  \\R\\2 
<2||i?||2  +  2e(||M||F  +  ||i?n||F). 

To  arrive  at  the  second  line,  we  use  the  fact  that  Xr  is  the  best  rank  r  approximation  to  A",  so 
||*  —  *r||2  <  ||*  —  M ||2  =  || 7?|| 2-  We  also  use  the  triangle  inequality  on  the  term  ||*||F,  but 
use  the  fact  that  since  the  algorithm  never  looked  at  *  on  Qc  it  is  fair  to  set  Rc>c  =  0. 


Let  us  now  turn  to  the  first  phase.  In  our  analysis  of  the  Algorithm  1,  we  proved  that  the  norm  of 
an  incoherent  vector  can  be  approximated  by  subsampling.  Specifically,  Lemma  2.9  shows  that 
with  high  probability,  the  estimates  ct  once  appropriately  rescaled  are  trapped  between  (1  —  a) ct 
and  (1  +  a)ct  where  a  =  log(n/5)  +  log(n/<5).  The  same  is  of  course  true  for  /. 

Setting  mi  >  32 ft  log (n/8)  we  find  that  a  <  1/2,  meaning  that  by  using  in  total  32 n/j  log (n/5) 
samples  in  the  first  phase,  we  approximate  the  target  sampling  distribution  to  within  a  multiplica¬ 
tive  factor  of  1/2  with  probability  >1  —  5. 


For  the  second  pass,  we  show  that  *  is  close  to  *  in  spectral  norm.  We  use  the  following  lemma: 

Lemma  2.15.  Provided  that  (1  —  a) ct  <  *ct  <  (1  +  a)ct  and  (1  —  a)  f  <  *-/  <  (1  +  a)  f, 
with  probability  >1  —  5: 


A'  -  A'||2  <  ||A'||f 


it  ol 
1  —  a 


max 


m2 


-,p  )  log 
n 


d  +  n 


4 

+  3 


dp 

myri 


log 


(») 


Proof.  Under  the  uniform  at  random  sampling  model,  we  will  apply  the  non-commutative  Bern¬ 
stein  inequality  (Proposition  A.4)  to  bound  ||*  —  *||2.  Recall  that  for  each  column  xt,  we 
observe  a  set  of  m2)t  =  m2n T  observations  and  form  the  zero-filled  vector  yt  defined  by: 


Vt  = 


d 

m2,t 


m2,t 

5 ^xt(is)eis 

s= 1 
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where  {is}™l  1  are  the  observations.  Since  the  set  of  observations  is  sampled  with  replacement 
(although  duplicates  in  each  half  of  the  sample  are  thrown  out),  each  entry  of  yt  occurs  with 
probability  d/ni2,t,  so  yt  is  an  unbiased  estimate  of  xt.  So  we  will  apply  the  rectangular  Matrix 
Bernstein  inequality  to  ytef  —  xtej.  Moreover: 


II ytej  -  xtej\\  <  \\yt\\ ||e*||  +  ||xt||  <  1  + 


\\xt\\<2j-^\\xt\ 
m2,t )  I/  m2,t 


which  follows  by  the  triangle  inequality,  Cauchy-Schwarz  and  the  chain  of  inequalities: 


_  d  I,  ..  I  d/i 

mh  <  Vm2,t||yt||oo  <  Halloo  <  \l - \\xt\\2 


y/rn^t 


m2,t 


When  we  plug  in  for  m2,t  we  get: 


\\ytej -xtej\\  <  2 


dy  ct 


m2n  ct 


f<  2\\X\\f 


dy  1  +  a 
m2n  1  —  a 


where  a  is  the  error  bound  from  the  first  phase  of  the  algorithm. 


As  for  the  variance  terms  in  Proposition  A.4,  both  turn  out  to  be  quite  small  as  we  will  soon  see. 
For  the  first  term: 
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The  first  equality  is  straightforward  while  the  second  follows  from  linearity  of  expectation  and 
the  fact  that  each  coordinate  of  yt  is  non-zero  with  probability  rn2,t/d-  The  third  line  follows 
from  the  fact  that  applying  the  sum  leads  to  an  n  x  n  diagonal  matrix  with  1 1  ay  1 1 2  on  the 
diagonal.  When  we  use  our  definition  of  m2jt  this  becomes: 
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For  the  second  term,  we  have: 
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Here  the  first  equality  is  trivial  while  the  second  one  uses  the  fact  that  off  diagonals  of  ytyf  are 
unbiased  for  xtxf  and  hence  we  are  left  with  a  diagonal  matrix.  To  arrive  at  the  second  line  we 
note  that  the  spectral  norm  a  diagonal  matrix  is  simply  the  largest  diagonal  entry.  Then  we  apply 
the  incoherence  assumption  and  final  our  sampling  distribution. 

At  this  point  we  may  apply  the  inequality  which  states  that  with  probability  >1  —  5: 
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The  interactive  sampling  procedure  has  a  dramatic  effect  on  the  bound  in  Lemma  2.15.  If  one 
sampled  uniformly  across  the  columns,  then  both  terms  grows  with  the  squared  norm  of  the 
largest  column  rather  than  with  the  average  squared  norms,  which  is  much  weaker  when  the 
matrix  energy  is  concentrated  on  a  few  columns.  This  is  precisely  when  the  row  space  is  coherent. 

To  wrap  up,  recall  that  1  <  p  <  d  and  n  >  d.  Setting  m\  >  32/ilog (n/5)  so  that  a  <  1/2,  the 
bound  in  Lemma  2.15  is  dominated  by: 

■*-*■■■ s  ,™,'5  vS'«  (t2)- 

Returning  to  Equation  2.25  we  can  now  substitute  in  for  e  and  conclude  the  proof. 


2.5  Empirical  Results 


We  perform  a  number  of  simulations  to  analyze  the  empirical  performance  of  both  Algorithms  1 
and  3.  The  first  set  of  simulations,  in  Figures  2.1  and  2.2,  examine  the  behavior  of  Algorithm  1. 
We  work  with  square  matrices  where  the  column  space  is  spanned  by  binary  vectors,  constructed 
so  that  the  matrix  has  the  appropriate  rank  and  coherence.  The  row  space  is  spanned  by  either 
random  gaussian  vectors  in  the  case  of  incoherent  row  space  or  a  random  collection  of  standard 
basis  elements  if  we  want  high  coherence. 

In  the  first  two  figures  (2.1(a)  and  2.1(b))  we  study  the  algorithms  dependence  on  the  matrix 
dimension.  For  various  matrix  sizes,  we  record  the  probability  of  exact  recovery  as  we  vary  the 
number  of  samples  allotted  to  the  algorithm.  We  plot  the  probability  of  recovery  as  a  function 
of  the  fraction  of  samples  per  column,  denote  by  p,  (Figure  2.1(a))  and  as  a  function  of  the  total 
samples  per  column  m  (Figure  2.1(b)).  It  is  clear  from  the  simulations  that  p  can  decrease  with 
matrix  dimension  while  still  ensuring  exact  recovery.  On  the  other  hand,  the  curves  in  the  second 
figure  line  up,  demonstrating  that  the  number  of  samples  per  column  remains  fixed  for  fixed 
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(a)  (b)  (c)  (d) 


Figure  2.1:  (a):  Probability  of  success  of  Algorithm  1  versus  fraction  of  samples  per  column 
ip  =  m/d)  with  r  =  10,  fi0  =  1.  (b):  Data  from  (a)  plotted  against  samples  per  column,  m.  (c): 
Probability  of  success  of  Algorithm  1  versus  fraction  of  samples  per  column  (p  =  m/d )  with 
n  =  500,  p0  =  1-  (d):  Data  from  (c)  plotted  against  rescaled  sample  probability  p/ ( r  logr). 


(a)  (b)  (c)  (d) 


Figure  2.2:  (a):  Probability  of  success  of  Algorithm  1  versus  fraction  of  samples  per  column  (p  = 
m/d)  with  n  =  500,  r  =  10.  (b):  Data  from  (a)  plotted  against  rescaled  sampling  probability 
p/p o-  (c):  Probability  of  success  of  SVT  versus  rescaled  sampling  probability  np/  log(n)  with 
r  =  5,/io  =  1.  (d):  Probability  of  success  of  Algorithm  1  and  SVT  versus  sampling  probability 
for  matrices  with  highly  coherent  row  space  with  r  =  5 ,  n  =  100. 

probability  of  recovery.  This  behavior  is  predicted  by  Corollary  2.2,  which  shows  that  the  total 
number  of  samples  scales  linearly  with  dimension,  so  that  the  number  of  samples  per  column 
remains  constant. 

In  Figures  2.1(c)  and  2.1(d)  we  show  the  results  of  a  similar  simulation,  instead  varying  the 
matrix  rank  r,  with  dimension  fixed  at  500.  The  first  figure  shows  that  the  fraction  of  samples  per 
column  must  increase  with  rank  to  ensure  successful  recovery  while  second  shows  that  the  ratio 
p/[r  log  r)  governs  the  probability  of  success.  Figures  2.2(a)  and  2.2(b)  similarly  confirm  a  linear 
dependence  between  the  incoherence  parameter  p0  and  the  sample  complexity.  Notice  that  the 
empirical  dependence  on  rank  is  actually  a  better  than  what  is  predicted  by  Corollary  2.2,  which 
suggests  that  r  log2  r  is  the  appropriate  scaling.  Our  theorem  does  seem  to  capture  the  correct 
dependence  on  the  coherence  parameter. 

In  the  last  two  plots  we  compare  Algorithm  1  against  the  Singular  Value  Thresholding  algorithm 
(SVT)  of  Cai  et  al.  [40].  The  SVT  algorithm  is  a  non-interactive  iterative  algorithm  for  nuclear 
norm  minimization  from  a  set  of  uniform- at-random  observations.  In  Figure  2.2(c),  we  show  that 
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3.5 


Figure  2.3:  (a):  An  example  matrix  with  with  highly  non-uniform  column  norms  and  (b)  the 
sampling  pattern  of  Algorithm  3.  (c):  Relative  error  as  a  function  of  sampling  probability  p  for 
different  target  rank  r  {p  =  1).  (d):  The  same  data  where  the  y-axis  is  instead  e/ y/r. 


the  success  probability  is  governed  by  np/  log(n),  which  is  predicted  by  the  existing  analysis  of 
the  nuclear  norm  minimization  program.  This  dependence  is  worse  than  for  Algorithm  1,  whose 
success  probability  is  governed  by  np  as  demonstrated  in  Figure  2.1(b).  Finally,  in  Figure  2.2(d), 
we  record  success  probability  versus  sample  complexity  on  matrices  with  maximally  coherent 
row  spaces.  The  simulation  shows  that  our  algorithm  can  tolerate  coherent  row  spaces  while  the 
SVT  algorithm  cannot. 

For  Algorithm  3,  we  display  the  results  of  a  similar  set  of  simulations  in  Figures  2.3  and  2.4. 
Here,  we  construct  low  rank  matrices  whose  column  spaces  are  spanned  by  binary  vectors  and 
whose  columns  are  also  constant  in  magnitude  on  their  support.  The  length  of  the  columns  is 
distributed  either  log-normally,  resulting  in  non-uniform  column  lengths,  or  uniformly  between 
0.9  and  1.1.  We  then  corrupt  this  low  rank  matrix  by  adding  a  gaussian  matrix  whose  entries  have 
variance  In  Figure  2.3(a)  we  show  a  matrix  constructed  via  this  process  and  in  Figure  2.3(b) 
we  show  the  set  of  entries  sampled  by  Algorithm  3  on  this  input.  From  the  plots,  it  is  clear  that 
the  algorithm  focuses  its  measurements  on  the  columns  with  high  energy,  while  using  very  few 
samples  to  capture  the  columns  with  lower  energy. 

In  Figure  2.3(c),  we  plot  the  relative  error,  which  is  the  e  in  Equation  2.3,  as  a  function  of  the 
average  fraction  of  samples  per  column  (averaged  over  columns,  as  we  are  using  non-uniform 
sampling)  for  500  x  500  mat  rices  of  varying  rank.  In  the  next  plot,  Figure  2.3(d),  we  rescale  the 
relative  error  by  y/r,  to  capture  the  dependence  on  rank  predicted  by  Theorem  2.5. 

As  we  increase  the  number  of  observations,  the  relative  error  decreases  quite  rapidly.  Moreover, 
the  algorithm  needs  more  observations  as  the  target  rank  r  increases.  Qualitatively  both  of  these 
effects  are  predicted  by  Theorem  2.5.  Lastly,  the  fact  that  the  curves  in  Figure  2.3(d)  nearly  line 
up  suggests  that  the  relative  error  e  does  scale  with  y/r. 

In  Figure  2.4(a),  we  plot  the  relative  error  as  a  function  of  the  average  fraction  of  samples,  p, 
per  column  for  different  matrix  sizes.  We  rescale  this  data  by  plotting  the  y- axis  in  terms  of  y/pe 
(Figure  2.4(b)).  From  the  first  plot,  we  see  that  the  error  quickly  decays,  while  a  smaller  fraction 
of  samples  are  needed  for  larger  problems.  In  the  second  plot,  we  see  that  rescaling  the  error  by 
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(a) 


(b) 


(c) 


(d) 


Figure  2.4:  (a):  Relative  error  of  Algorithm  3  as  a  function  of  sampling  probability  p  for  different 
size  matrices  with  fixed  target  rank  r  =  10  and  p  —  1.  (b):  The  same  data  where  the  y- 
axis  is  instead  ^/pe.  (c):  Relative  error  for  interactive  and  non-interactive  sampling  on  matrices 
with  uniform  column  lengths  (column  coherence  p  =  1  and  column  norms  are  uniform  from 
[0.9, 1.1]).  (c):  Relative  error  for  interactive  and  non-interactive  sampling  on  matrices  with  highly 
nonuniform  column  lengths  (column  coherence  /i  =  1  and  column  norms  are  from  a  standard 
Log-Normal  distribution). 


y/p  has  the  effect  of  flattening  out  all  of  the  curves,  which  suggests  that  the  relationship  between 
e  and  the  number  of  samples  is  indeed  e^/p  x  1  or  that  e  x  -4=.  This  phenomenon  is  predicted 
by  Proposition  2.7. 

In  the  last  set  of  simulations,  we  compare  our  algorithm  with  an  algorithm  that  first  performs 
uniform  sampling  and  then  hard  thresholds  the  singular  values  to  build  a  rank  r  approximation. 
In  Figure  2.4(c),  we  use  matrices  with  uniform  column  norms,  and  observe  that  both  algorithms 
perform  comparably.  However,  in  Figure  2.4(d),  when  the  column  norms  are  highly  non-uniform, 
we  see  that  Algorithm  3  dramatically  outperforms  the  passive  sampling  approach.  This  confirms 
our  claim  that  interactive  sampling  leads  to  better  approximation  when  the  energy  of  the  matrix 
is  not  uniformly  distributed. 

Finally,  we  compare  Algorithm  3  with  a  non-interactive  matrix  approximation  algorithm  on  two 
real  datasets.  The  non-interactive  algorithm  is  the  same  hard  thresholding  algorithm  used  in 
Figures  2.4(c)  and  2.4(d).  The  first  dataset  is  a  400-node  subset  of  the  King  internet  latency 
dataset  taken  from  Gummadi  et  al.  [98].  This  dataset  is  much  larger  but  has  many  missing 
entries  so  we  used  a  400  x  400  submatrix  with  minimal  missing  entries  for  our  experiment.  The 
second  dataset  is  a  1000  x  10,  000  submatrix  of  the  PubChem  molecular  similarity  dataset  [  55]. 
We  used  target  rank  26  for  the  King  dataset  and  25  for  the  PubChem  dataset. 

In  Figure  2.5,  we  record  the  log  excess  risk  as  a  function  of  the  fraction  of  samples  for  both 
interactive  and  non-interactive  matrix  approximation  algorithms.  The  interactive  algorithm  out¬ 
performs  the  non-interactive  one  on  both  datasets,  but  the  improvement  is  more  drastic  for  the 
King  dataset.  Moreover,  the  performance  improvements  are,  in  absolute  terms,  better  in  the  low- 
sample  regime,  which  is  apparent  since  the  separation  between  the  curves  stays  roughly  constant, 
but  this  is  on  a  logarithmic  scale.  This  demonstrates  that  interactive  sampling  is  favorable  for 
these  matrix  approximation  problems,  and  should  be  used  in  applications  where  it  is  feasible. 
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Figure  2.5:  Experiments  on  real  datasets.  Left:  Log  excess  risk  for  passive  and  interactive  matrix 
approximation  algorithms  on  a  400-node  subset  of  the  King  internet  latency  dataset  with  target 
rank  r  =  26.  Right:  Log  excess  risk  for  passive  and  interactive  matrix  approximation  algorithms 
on  a  1000  x  10,  000  submatrix  of  the  PubChem  Molecular  Similarity  dataset  with  target  rank 
r  =  25.  Passive  algorithm  is  based  on  uniform  sampling  followed  by  hard  thresholding  of  the 
singular  values. 


2.6  Conclusions 


This  paper  considers  the  two  related  problems  of  low  rank  matrix  completion  and  matrix  approx¬ 
imation.  In  both  problems,  we  show  how  to  use  interactive  sampling  to  overcome  uniformity 
assumptions  that  have  pervaded  the  literature.  Our  algorithms  focus  measurements  on  interest¬ 
ing  columns  (in  the  former,  the  columns  that  contain  new  directions  and  in  the  latter,  the  high 
energy  columns)  and  have  performance  guarantees  that  are  significantly  better  than  any  known 
passive  algorithms  in  the  absence  of  uniformity.  Moreover,  they  are  competitive  with  state-of- 
the-art  passive  algorithms  in  the  presence  of  uniformity.  Our  algorithms  are  conceptually  simple, 
easy  to  implement,  and  fairly  scalable. 

Turning  to  the  themes  of  this  thesis,  we  showed  how  interactive  sampling  enables  a  relaxation 
of  uniformity  requirements  for  these  completion  and  approximation  problems.  Specifically,  our 
algorithms  do  not  require  incoherence  assumptions  on  the  row  space  to  succeed,  and  we  showed 
that  all  non-interactive  procedures  do.  Our  algorithms  are  also  statistically  and  computationally 
efficient,  which  can  be  seen  both  theoretically  and  empirically  in  simulations.  Thus  we  believe 
that  these  completion  problems  make  a  compelling  case  for  interactive  learning. 
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Chapter  3 


Interactive  Hierarchical  Clustering 


Clustering  problems  involve  assigning  objects  to  one  or  more  groups,  so  that  objects  in  the  same 
group  are  very  similar  while  objects  in  different  groups  are  dissimilar.  In  hierarchical  cluster¬ 
ings,  the  groups  have  multiple  resolutions,  so  that  a  large  cluster  may  be  recursively  divided 
into  smaller  sub-clusters.  These  types  of  problems  are  ubiquitous;  they  are  fundamental  tools 
in  exploratory  data  analysis,  data  mining,  and  many  scientific  domains.  There  exist  many  ef¬ 
fective  algorithms  for  clustering,  but  as  data  sets  increase  in  size,  the  fact  that  these  algorithms 
require  every  pairwise  similarity  between  objects  poses  a  serious  measurement  and/or  computa¬ 
tional  burden  and  limits  the  scope  for  application.  It  is  therefore  practically  appealing  to  develop 
clustering  algorithms  that  are  effective  on  large  scale  problems  but  also  have  low  measurement 
and  computational  overhead. 

To  achieve  low  overhead,  we  focus  on  reducing  the  number  of  similarity  measurements  required 
for  clustering.  This  approach  results  in  an  immediate  reduction  in  measurements  in  applications 
where  similarities  are  observed  directly,  but  it  can  also  provide  dramatic  computational  gains 
in  applications  where  similarities  between  objects  are  computed  via  some  kernel  evaluated  on 
observed  object  features.  The  case  of  internet  topology  inference  is  an  example  of  the  former, 
where  covariance  in  the  packet  delays  observed  at  nodes  reflects  the  similarity  between  them. 
Obtaining  these  similarities  requires  injecting  probe  packets  into  the  network  and  places  a  sig¬ 
nificant  burden  on  network  infrastructure.  Phylogenetic  inference  and  other  biological  sequence 
analyses  are  examples  of  the  latter,  where  computationally  intensive  edit  distances  are  often  used. 
Note  that  both  situations  result  in  a  low  memory  footprint  as  fewer  pairwise  similarities  need  to 
be  stored.  In  both  cases  our  algorithms  have  dramatically  lower  overhead  than  many  popular 
algorithms. 

In  this  chapter,  we  propose  a  novel  approach  to  hierarchical  clustering  through  interactivity,  an 
algorithmic  paradigm  where  only  a  small  number  of  informative  similarities  are  measured.  We 
develop  a  meta-algorithm  that  iteratively  applies  a  base  clustering  algorithm  to  small  groups  of 
objects  and  that  can  be  instantiated  with  any  similarity-based  clustering  algorithm  This  meta¬ 
algorithm  allows  the  user  to  specify  a  level  of  interactivity,  and  we  provide  theoretical  analysis 
that  quantifies  the  resulting  trade-off  between  measurement  overhead  and  computation  time  on 
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one  hand,  and  statistical  accuracy  on  the  other. 

As  an  example,  we  apply  our  framework  to  spectral  clustering.  Spectral  clustering  is  a  popu¬ 
lar  clustering  technique  that  relies  on  the  structure  of  the  eigenvectors  of  the  Laplacian  of  the 
similarity  matrix.  These  algorithms  have  received  considerable  attention  in  recent  years  because 
of  their  empirical  success,  but  they  suffer  from  the  fact  that  they  require  all  n(n  —  l)/2  sim¬ 
ilarities  and  must  compute  a  spectral  decomposition  of  the  n  x  n  similarity  matrix,  which  on 
large  datasets  can  be  computationally  prohibitive,  in  terms  of  both  running  time  and  space.  Our 
interactive  algorithm  avoids  both  of  these  limitations  by  subsampling  few  objects  in  each  round 
and  only  computing  eigenvectors  of  very  small  sub-matrices.  By  appealing  to  previous  statisti¬ 
cal  guarantees  [16],  we  can  show  that  this  algorithm  has  desirable  theoretical  properties,  both  in 
terms  of  statistical  and  computational  performance. 

We  also  establish  several  necessary  conditions  in  the  noisy  constant  block  model,  under  which 
we  analyze  spectral  clustering.  We  give  lower  bounds  on  the  sample  complexity  for  interactive 
procedures  in  the  noiseless  case  and  also  a  lower  bound  on  non-interactive  procedures  for  noisy 
hierarchical  clustering.  Comparing  this  latter  lower  bound  with  the  analysis  for  our  interactive 
spectral  clustering  algorithm  concretely  demonstrates  that  interactivity  is  a  powerful  learning 
paradigm  for  hierarchical  clustering  from  pairwise  similarity  information. 

Our  detailed  contributions  are: 

1.  We  develop  a  principled  method  for  converting  a  non-interactive  non-hierarchical  cluster¬ 
ing  algorithm  into  an  interactive  hierarchical  one,  and  we  show  how  performance  guar¬ 
antees  on  the  subroutine  translate  into  performance  guarantees  for  hierarchical  clustering 
(Theorem  3.1).  This  technique  can  be  thought  of  as  a  simple  reduction:  we  reduce  the 
interactive  hierarchical  clustering  problem  to  non-interactive  flat  clustering  problem. 

2.  As  an  example,  we  give  a  detailed  statistical  analysis  of  the  interactive  spectral  clustering 
algorithm  derived  by  our  reduction.  In  a  model  for  similarity  based  clustering,  we  show 
that  this  interactive  spectral  algorithm  use  0(npolylog(n))  pairwise  similarities  and  runs 
in  0(npolylog(n))  time,  to  obtain  a  hierarchical  clustering  on  n  objects  (Theorem  3.2). 

3.  We  prove  that  any  similarity  based  clustering  algorithm  must  obtain  Q(n  log  n/  log  log  n) 
similarities,  even  in  the  absence  of  noise  (Theorem  3.3).  This  lower  bound  certifies  the 
near-optimality  of  our  approach. 

4.  We  also  show  lower  bounds  against  non-interactive  approaches.  In  the  same  model  used  for 
Theorem  3.2,  we  show  that  any  non-interactive  sampling  strategy  followed  by  any  recovery 
algorithm  must  use  f)(n2)  measurements  to  achieve  the  same  statistical  performance  as 
our  non-interactive  approach  using  0(n  log2  n)  (Theorem  3.5).  This  certifies  the  power  of 
interactivity  for  this  problem. 

5.  We  complement  these  theoretical  results  with  detailed  empirical  evaluation. 

This  chapter  provides  support  for  our  thesis  on  all  fronts.  Our  interactive  clustering  algorithm 
is  statistically  more  powerful  than  non-interactive  approaches  in  the  sense  that  we  can  recover 
cluster  structure  with  far  fewer  measurements.  It  also  has  theoretically  and  empirically  faster 
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running  time  as  certified  by  Theorem  3.2  and  our  experimental  evaluation.  Lastly,  we  measure 
uniformity  by  the  size  of  the  smallest  cluster  we  hope  to  recover,  where  smaller  clusters  make 
the  problem  less  uniform,  and  our  results  are  magnified  in  the  presence  of  non- uniformity. 


3.1  Related  Work 


There  is  a  large  body  of  work  on  hierarchical  and  partitional  clustering  algorithms,  many  coming 
with  various  theoretical  guarantees,  but  only  few  algorithms  attempt  to  minimize  the  number  of 
pairwise  similarities  used  [  ,  7,  ].  Along  this  line,  the  work  of  Eriksson  el  al.  [  ]  and 

Shamir  and  Tishby  [154]  is  closest  in  flavor  to  ours. 

Eriksson  et  al.  [8  ]  develop  an  interactive  algorithm  for  hierarchical  clustering  and  analyze  the 
correctness  and  measurement  complexity  under  a  noise  model  where  a  small  fraction  of  the 
similarities  are  inconsistent  with  the  hierarchy.  This  bears  resemblance  to  the  persistent  noise 
model  that  we  study  in  Chapter  4,  although  the  learning  task  considered  there  is  substantially 
different.  They  show  that  for  a  constant  fraction  of  inconsistent  similarities,  their  algorithm  can 
recover  hierarchical  clusters  up  to  size  Q(  log  n)  using  ()(n  log2  n)  similarities.  Our  analysis  for 
ActiveS pectral  yields  similar  results  in  terms  of  noise  tolerance,  measurement  complexity, 
and  resolution,  but  in  the  context  of  i.i.d.  subgaussian  noise  rather  than  inconsistencies.  Our 
algorithm  is  also  computationally  more  efficient. 

Another  approach  to  minimizing  the  number  of  similarities  used  is  via  perturbation  theory,  which 
suggests  that  randomly  sampling  the  entries  of  a  similarity  matrix  preserves  properties  such  as 
its  spectral  norm  [1],  With  this  result,  the  Davis-Kahan  theorem  suggests  that  spectral  clustering 
algorithms,  which  look  at  the  eigenvectors  of  the  Laplacian  associated  with  the  similarity  matrix, 
can  succeed  in  recovering  the  clusters.  This  intuition  is  formalized  by  Shamir  and  Tishby  [  ] 

who  analyze  a  binary  spectral  algorithm  that  randomly  samples  b  entries  from  the  similarity 
matrix.  They  show  an  l2  bound  on  difference  between  the  eigenvectors  from  before  and  after 
subsampling,  but  such  a  bound  does  not  immediately  translate  into  a  strong  exact  recovery  guar¬ 
antee.  Indeed,  to  use  this  bound  in  the  constant  block  model  that  we  study  here,  one  would  need 
b  =  Q(n2)  measurements  to  obtain  an  exact  recovery  guarantee,  which  provides  essentially  no 
improvement.  Our  work,  translated  to  the  flat  clustering  setting  is  much  stronger;  Theorem  3.2 
implies  that  O(nlogn)  similarities  are  needed  to  recover  the  clustering.  Furthermore,  we  can 
give  guarantees  on  the  size  of  smallest  cluster  f2(logn)  that  can  be  recovered  in  a  hierarchy  by 
selectively  sampling  similarities  at  each  level. 

Recently  Voevodski  et  al.  [161  ]  proposed  an  interactive  algorithm  for  flat  A:- way  clustering  that 
selects  0(k )  landmarks  and  partitions  the  objects  using  distances  to  these  landmarks.  Theoret¬ 
ically,  the  authors  guarantee  approximate-recovery  of  clusters  of  size  Q(n)  using  0(nk )  pair¬ 
wise  distances.  This  idea  of  selecting  landmarks  bears  strong  resemblance  to  the  first  phase  of 
our  interactive  clustering  algorithm  and  also  has  connections  to  the  Landmark  MDS  algorithm 
of  de  Silva  and  Tenenbaum  [71].  These  approaches  are  tied  to  specific  algorithms,  while  our 
framework  is  much  more  general.  Moreover,  we  guarantee  exact  cluster  recovery  (under  mild 
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assumptions)  rather  than  approximate  recovery,  which  translates  into  guarantees  on  hierarchical 
clustering.  This  distinction  is  important  because  of  the  recursive  nature  of  hierarchical  clustering. 

A  related  direction  is  the  body  of  work  on  efficient  streaming  and  online  algorithms  for  ap¬ 
proximating  the  /i- means  and  /('-medians  objectives  (See,  e.g.,  [  ,  ]).  As  with  Voevodski  et 

al.  [  L 68],  the  guarantees  for  these  algorithms  do  not  immediately  translate  into  an  exact  recovery 
guarantee,  making  it  challenging  to  transform  these  approaches  into  hierarchical  clustering  algo¬ 
rithms.  Moreover,  the  success  of  spectral  clustering  in  practice  suggests  that  an  efficient  spectral 
algorithm  would  also  be  very  appealing.  While  there  have  been  advances  in  this  direction,  the 
majority  of  these  require  the  entire  similarity  matrix  be  known  a  priori  [92],  Apart  from  [154], 
we  know  of  no  other  spectral  algorithm  that  optimizes  the  number  of  similarities. 

Another  related  line  of  research  focuses  on  building  data  structure  for  fast  nearest  neighbor  com¬ 
putations  of  a  point  set.  Many  of  these  structures  build  hierarchical  clusterings  of  the  data  points 
so  that  traversing  the  tree  to  find  the  nearest  neighbor  of  a  data  point  can  be  done  in  logarithmic 
time  [28,  172].  Both  the  vantage  point  tree  and  the  cover  tree  have  the  additional  property  that 
only  (){n  log(n))  distances  are  used  to  create  the  hierarchical  clustering,  which  translates  to  both 
measurement  and  computational  efficiency  in  our  setting.  The  main  differences  are:  (a)  these 
algorithms  assume  a  metric  space,  (b)  the  algorithms  do  not  partition  the  points  at  each  level, 
but  rather  create  overlapping  coverings,  and  (c)  the  algorithms  insert  points  into  the  structure 
iteratively  in  contrast  with  the  recursive  partitioning  of  our  algorithm.  Further,  we  are  not  aware 
of  any  statistical  analysis  of  these  data  structures  for  the  hierarchical  clustering  problem. 

There  are  also  a  few  papers  that  consider  alternative  models  of  interaction  for  clustering  prob¬ 
lems.  Two  types  of  interaction  in  the  literature  are  supervision  via  must-link  and  cannot-link 
constraints  [27,  170],  and  via  split  or  merge  requests  of  an  existing  clustering  [  15,  18].  In  these 
models,  interactivity  supplements  the  pairwise  similarities  that  are  available  up  front  and  enables 
guarantees  under  weaker  separation  assumptions.  In  contrast,  in  our  setting,  the  similarities  are 
not  available  up  front  and  we  employ  interactivity  to  selectively  obtain  them.  Consequently,  our 
setting  is  more  challenging  than  even  the  fully  observed  case. 

Lastly,  our  approach  loosely  falls  into  the  framework  of  reductions  for  machine  learning  [  >2], 
The  broad  theme  of  this  work  is  to  leverage  existing  algorithms  to  solve  more  complex  learning 
tasks,  and  existing  results  show  how  many  prediction  problems,  including  structured  predic¬ 
tion  [69],  contextual  bandits  [  1,  78],  multi-class  classification  [  26],  can  all  be  reduced  to  binary 
classification.  Our  work  shows  how  interactive  hierarchical  clustering  can  be  reduced  to  non¬ 
interactive  non-hierarchical  clustering,  so  that  existing  algorithms  for  the  latter  can  immediately 
be  applied  to  the  former. 


3.2  Main  Results 


We  first  clarify  some  notation  and  introduce  a  hierarchical  clustering  model  that  we  will  analyze. 
We  refer  to  A  as  any  flat  clustering  algorithm,  which  takes  as  parameters  a  dataset  and  a  natural 
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Algorithm  4  ActiveCluster(„4,  s,  {xj}^=1,  k) 
if  n  <  s  then  return  {xi}j=1 

Draw  S  C  {.x-,} -L,  of  size  s  uniformly  at  random. 
C[,...C'k<-A(S,h). 

Set  Ci  <-  C[, . . .  Ck  <-  <7'. 
for  Xj  G  {xj}"=1  \  S'  do 
Vj  g  [fc],  oij  <—  |^|  ExiSc' 

Cargmax^j aj  C’argmax^j a-j  U  {Xj}. 

end  for 

output  {Cj,  ActiveCluster (A,s,Cj,k)}^=1 


number  k,  indicating  the  number  of  clusters  to  produce.  Throughout  the  chapter,  k  will  denote 
the  number  of  clusters  at  any  split,  and  we  will  assume  that  k  is  known  and  fixed  across  the 
hierarchy.  We  let  n  be  the  number  of  objects  in  the  dataset  and  define  s  to  be  a  parameter  to 
our  algorithm,  influencing  the  number  of  measurements  used  by  our  algorithm.  The  parameter 
s  reflects  a  tradeoff  between  the  measurement  overhead  and  the  accuracy;  increasing  s  increases 
the  robustness  of  our  method  at  the  cost  of  requiring  more  measurements.  Finally,  our  algorithms 
employ  an  abstract,  possibly  noisy  similarity  function  K,  which  can  model  both  cases  where 
similarities  are  measured  directly  and  where  they  are  computed  via  some  kernel  function  based 
on  observed  object  features. 

Definition  3.1.  A  k-way  hierarchical  clustering  C  on  objects  is  a  collection  of  clusters 

such  that  Co  =  {x,}”=1  G  C  and  for  each  C'*,  Cj  G  C  either  Ci  C  Cj ,  Cj  C  Ci  or  Ci  IT  Cj  =  0. 
For  any  cluster  C,  if  3C  with  C'  C  C,  then  there  exists  a  set  {  C,  }j=l  of  disjoint  clusters  such 
that  Uti  Ci  =  C. 

Every  hierarchical  clustering  C  has  a  parameter  i ]  that  quantifies  how  balanced  the  clusters  are 
at  any  split.  Formally,  r]  >  maxsplits(Clj  ,  where  each  split  is  a  non-terminal  cluster, 

partitioned  into  {C'i}f=1.  //  upper  bounds  the  ratio  between  the  largest  and  smallest  clusters  sizes 
across  all  splits  in  C.  This  type  of  balancedness  parameter  has  been  used  in  previous  analyses 
of  clustering  algorithms  [16,  :  ],  and  it  is  common  to  assume  that  the  clustering  is  not  too 
unbalanced.  For  clarity  of  presentation,  we  will  state  our  results  assuming  r]  =  0(1),  although 
our  proofs  contain  a  precise  dependence  between  the  sampling  parameter  s  and  r/. 


3.2.1  An  Interactive  Clustering  Framework 

Our  primary  contribution  is  the  introduction  of  a  novel  framework  for  hierarchical  clustering  that 
is  efficient  both  in  terms  of  the  number  of  similarities  used  and  the  algorithmic  running  time.  To 
recover  any  single  split  of  the  hierarchy,  we  run  a  flat  clustering  algorithm  A  on  a  small  subset  of 
the  data  to  compute  a  seed  clustering  of  the  dataset.  Using  this  initial  clustering,  we  place  each 
remaining  object  into  the  seed  cluster  for  which  it  is  most  similar  on  average.  This  results  in  a 
flat  clustering  of  the  entire  dataset,  using  only  similarities  to  the  objects  in  the  small  subset. 
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Figure  3.1:  Sampling  pattern  of  Algorithm  4 


By  recursively  applying  this  procedure  to  each  cluster,  we  obtain  a  hierarchical  clustering,  using 
a  small  fraction  of  the  similarities.  In  this  recursive  phase,  we  do  not  observe  any  measurements 
between  clusters  at  the  previous  split,  i.e.  to  partition  Cj,  we  only  observe  similarities  between 
objects  in  Cj.  This  results  in  an  interactive  algorithm  that  focuses  its  measurements  to  resolve 
the  higher-resolution  cluster  structure. 

Pseudocode  for  the  meta- algorithm  is  shown  in  Algorithm  4.  As  a  demonstration,  in  Figure  3.1, 
we  show  the  sampling  pattern  of  Algorithm  4  on  the  first  and  second  splits  of  a  hierarchy,  in 
addition  to  the  patterns  at  the  end  of  the  computation.  Only  the  similarities  shown  in  white  are 
needed.  As  is  readily  noticeable,  the  algorithm  uses  very  few  similarities  but  is  stable  able  to 
recover  this  hierarchical  clustering. 

Our  main  theoretical  contribution  is  a  characterization  of  Algorithm  4  in  terms  of  probability  of 
success  in  recovering  the  true  hierarchy  (denoted  C*),  measurement,  and  runtime  complexity.  To 
make  these  guarantees,  we  will  need  some  mild  restrictions  on  the  similarity  function  K,  which 
ensure  that  the  similarities  agree  with  the  hierarchy  (up  to  some  random  noise): 

K1  For  each  Xi  G  Cj  e  C *  and  j'  ^  j: 


min  E\K(xi,Xk)}  —  max  E [K(xi,Xk)\  >  7  >  0 


where  expectations  are  taken  with  respect  to  the  possible  randomness  in  K. 

K2  For  each  object  x,  <E  Cj,  and  a  set  of  M:]  objects  of  size  rrij  drawn  uniformly  from  Cj  \  {x,}, 
we  have: 


P  min 


where  a2  >  0  parameterizes  the  randomness  in  the  similarity  function  K .  Similarly,  a  set 
Mji  of  size  rrij'  drawn  uniformly  from  cluster  Cj’  with  j'  ^  j  satisfies: 


<  2  exp 
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K1  states  that  the  similarity  from  an  object  xr  to  its  cluster  should,  in  expectation,  be  larger  than 
the  the  similarity  from  that  object  to  any  other  cluster.  This  is  related  to  the  Tight-Clustering  con¬ 
dition  used  by  Eriksson  et  al.  [8'  ]  and  less  stringent  than  earlier  results  which  assume  that  within- 
and  between-cluster  similarities  are  constant  and  bounded  in  expectation  [  ].  Moreover,  an 

assumption  of  this  form  seems  necessary  to  ensure  that  one  could  identify  the  clustering  with 
access  to  a  non-random  similarity  function,  K .  K2  enforces  that  within-  and  between-cluster 
similarities  concentrate  appropriately.  This  condition  is  satisfied,  for  example,  if  similarities 
are  constant  in  expectation,  perturbed  with  independent  subgaussian  noise.  We  emphasize  that 
K2  subsumes  many  of  the  assumptions  of  previous  clustering  analyses  (for  example  [16,  1  ]). 

Moreover,  if  the  similarity  function  is  deterministic,  then  K2  is  altogether  unnecessary,  and  some 
improvements  to  our  algorithm  are  possible  (see  Proposition  3.4). 

Our  main  results  characterizes  Algorithm  4  under  assumptions  K1  and  K2: 

Theorem  3.1.  Let  }(=  i  be  a  dataset  with  true  hierarchical  clustering  C* ,  let  K  be  a  similarity 
function  satisfying  assumptions  K1  and  K2  and  consider  any  flat  clustering  algorithm  A  with  the 
following  property: 

Al  For  any  dataset  with  clustering  C  where  K  satisfies  K1  and  K2,  A({j/,;}™  , ,  k)  re¬ 

covers  the  first  split  ofC  with  probability  at  least  1  —  C\tnke~m  for  some  constant  C\  >  0. 

Then  Algorithm  4,  on  input  (A.:  s,  {®i}"=  i,k): 


R1  recovers  all  clusters  of  size  at  least  s  with  probability: 

—s 


c0n  exp 


2(1  +  ?7)2 


Cin2exp(— s)  —  C^nk  log  n  exp  ^ 


— 72s 


4<t2(1  +  T])2 


(3.1) 


for  universal  positive  constants  Co,  c±  and  for  another  constant  G'r;  that  depends  only  on  rj. 
This  probability  of  success  is  1  —  o(l)  as  long  as  s  =  oj  ( rnaxjl,  ^ }  log(nfc) ). 


R2  uses  0(ns  logn)  similarity  measurements. 

R3  runs  in  time  0(nAs  +  ns  logn)  where  A  on  a  datasets  of  size  s  runs  in  time  0(AS). 


At  a  high  level,  the  theorem  says  that  the  clustering  guarantee  for  a  flat,  non-interactive  algorithm, 
A,  can  be  translated  into  a  hierarchical  clustering  guarantee  for  an  interactive  version  of  A,  and 
that  this  new  algorithm  enjoys  significantly  reduced  measurement  and  runtime  complexity.  The 
only  property  needed  by  A  is  that  it  recovers  a  flat  clustering  with  very  high  probability.  While 
the  probability  of  success  seems  strangely  high,  we  will  show  that  for  a  fairly  intuitive  model, 
a  simple  spectral  clustering  algorithm  meets  assumption  Al.  Verifying  that  the  model  satisfies 
the  conditions  K1  and  K2,  immediately  results  in  a  guarantee  for  the  interactive  version  of  this 
spectral  algorithm. 

We  defer  the  proof  of  this  theorem,  and  all  theoretical  results  in  this  chapter  to  Section  3.4. 
However,  before  proceeding,  some  remarks  are  in  order.  First,  by  plugging  in  the  lower  bound 
for  s  into  the  upper  bound  on  the  measurement  complexity,  we  see  that  Algorithm  4  needs 
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Algorithm  5  SpectralCluster(W) 

Compute  Laplacian  L  =  D  —  W,  Du  =  YTj=\  Wij 
V2  smallest  non-constant  eigenvector  of  L. 

C\  <—  {i  :  1)2(1)  >  0},  Cr2  <—  {j  ■  V2 (j)  <  0} 
output  {Ci,C2}. 


0(n\og(nk )  logn)  similarities,  which  is  considerably  less  than  the  0(n2)  similarities  required 
by  a  non-interactive  algorithm.  Second,  at  the  lower  bound  for  s,  we  see  that  unless  A  runs  in 
exponential  time,  Algorithm  4  runs  in  0(n),  which  is  polynomially  faster  than  any  clustering 
algorithm  that  observes  all  of  the  similarities,  as  such  an  algorithm  must  take  Q(n2)  time. 


3.2.2  Interactive  Spectral  Clustering 

To  make  the  guarantees  in  Theorem  3. 1  more  concrete,  we  show  how  to  translate  the  result  into  a 
real  guarantee  for  a  specific  subroutine  algorithm  A.  We  study  a  simple  spectral  algorithm  (See 
pseudocode  in  Algorithm  5)  into  an  interactive  clustering  algorithm,  using  the  analysis  from 
Balakrishnan  et  al.  [16].  The  algorithm  operates  on  hierarchically  structured  similarity  matrices 
referred  to  as  the  noisy  Constant  Block  Matrices  (again  from  Balakrishnan  et  al.  [16]). 

We  study  the  special  case  of  binary  hierarchical  clustering,  where  each  non-terminal  cluster  is 
partitioned  into  exactly  two  groups.  As  a  naming  convention,  we  identify  a  cluster  by  a  string  £ 
of  L  and  R  symbols.  The  two  sub-clusters  of  a  non- terminal  cluster  Cfi  are  C^0l  and  C^0r- 

The  noisy  Constant  Block  Model  is  defined  using  this  terminology  as  follows: 

Definition  3.2.  A  similarity  matrix  W  is  a  noisy  constant  block  matrix  ( noisy  CBM )  ifW  = 
A  +  R  where  A  is  ideal  and  R  is  a  perturbation  matrix: 

•  An  ideal  similarity  matrix  is  characterized  by  off-block  diagonal  similarity  vcdues  (3%  € 
[0, 1]  for  each  cluster  C ^  such  that  if  x  G  C^l  and  y  G  C^0r,  where  CffQjJ  and  C^0r  are 
two  sub-clusters  of  Cf  at  the  next  level  in  a  binary  hierarchy,  then  Ax  y  =  (f.  Additionally, 
min{/3?oR,  fffoL}  >  /^.  Define  7  =  min{min€{min{^oR,  j3^oL}  -  /3J, /30},  where  f30  is 
the  minimum  overall  similarity. 

•  A  symmetric  (n  x  n)  matrix  R  is  a  perturbation  matrix  with  parameter  cr  if  (a)  E  (Rij)  = 
0,  (b)  the  entries  ofR  are  subgaussian,  that  is  E(exp(fi?^))  <  exp  j  and  (c)for  each 
row  i,  R,  1 , . . .  Rin  are  independent. 

To  apply  Theorem  3.1,  we  need  to  verify  that  the  assumption  K1  and  K2  are  met  and  Algorithm  5 
succeeds  with  exponentially  high  probability.  Checking  that  these  conditions  hold  provided  the 
signal-to-noise  ratio  is  large  enough  results  in  the  following  guarantees  for  ActiveSpectral, 
the  interactive  version  of  Algorithm  5.  Proof  of  this  theorem  is  deferred  to  Section  3.4. 

Theorem  3.2.  Let  W  be  a  noisy  CBM  with  and  7]  =  0(1),  and  with  n  >  7i0,  the  latter  of 
which  is  a  universal  constant.  Then  for  any  m  >  s,  ACTIVESPECTRAL  succeeds  in  recovering 
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all  clusters  of  size  m  with  probability  1  —  o(l)  as  long  as  s  =  u>  (rnaxjl,  log(n)  j. 

ActiveSpectral  uses  0{ns  logn)  measurements  and  runs  in  0(ns 2  log  s  +  ns  logn)  time. 

This  theorem  quantifies  the  tradeoff  between  statistical  robustness  and  measurement  complexity 
for  the  hierarchical  spectral  algorithm.  On  one  end,  if  7 2 /o  =  0(1),  then  ACTIVES PECTRALcan 
successfully  recover  clusters  of  size  logn  while  using  0(n  log2  n)  measurements.  At  the  other 
end  of  this  spectrum,  if  s  =  @(n),  then  we  can  tolerate  (7  x  yfj,  ,  but  can  only  recover 

clusters  of  size  @(n).  This  is  essentially  the  same  as  the  result  of  Balakrishnan  et.  al.  [  6],  who 
show  that  by  using  0(n2)  measurements,  one  can  tolerate  noise  that  grows  fairly  rapidly  with  n. 
Varying  s  allows  for  interpolation  between  these  two  extremes. 

Several  remarks  are  in  order: 

1.  First,  note  that  the  condition  s  must  grow  faster  than  log(n)  implies  that  the  smallest  clus¬ 
ters  of  the  hierarchy  cannot  be  recovered.  These  clusters  are  irrecoverably  buried  in  noise, 
so  one  should  not  expect  that  recovery  is  possible. 

2.  The  condition  that  y2 / o  =  0(1)  is  undesirable  for  several  reasons.  Since  the  similarities 
are  bounded  between  zero  and  one,  y2  <  7,  so  this  condition  is  more  stringent  than  re¬ 
quiring  that  7/cx  =  0(1),  which  is  a  more  natural  measure  for  the  signal-to-noise  ratio. 
Secondly,  if  the  minimum  cluster  size  remains  fixed  as  n  grows,  7  must  decrease,  which 
implies  that  we  require  o  — >■  0  for  consistent  recovery. 

3.  On  the  other  hand  if  the  depth  of  the  hierarchy  remains  fixed  as  n  increases,  then  7  can 
remain  constant,  so  that  it  suffices  to  have  a  =  0(1)  for  exact  recovery.  Unfortunately,  for 
this  to  happen,  the  minimum  cluster  size  must  scale  linearly  with  n ,  although  in  this  case 
one  can  still  aggressively  subsample  the  matrix  to  recover  all  of  these  clusters. 

The  proof  of  this  Theorem  is  not  quite  a  direct  application  of  Theorem  3.1.  Instead,  we  show  that 
Algorithm  5  meets  assumption  Al  modulo  a  term  that  depends  on  7  and  cr,  and  then  we  plug  this 
into  Equation  3.1  (replacing  the  second  exponential  term).  Solving  for  s  in  the  updated  version  of 
Equation  3.1  proves  the  theorem.  If  we  instead  directly  applied  Theorem  3.1,  we  would  require 
the  SNR  to  be  f2(l)  and  arrive  at  one  end  of  the  tradeoff  between  robustness  and  measurement 
complexity.  Our  approach  allows  one  to  see  how  varying  s  affects  the  tolerable  SNR. 


3.2.3  Active  k- means  clustering 

It  is  also  possible  to  insert  Lloyd’s  algorithm  for  /.'-means  clustering  into  our  framework,  but  we 
cannot  prove  statistical  performance  guarantees  since  it  is  unknown  whether  Lloyd’s  algorithm 
satisfies  assumption  Al  for  any  meaningful  model.  A'- means  helps  illuminate  the  differences 
between  observing  similarities  directly  and  computing  similarities  from  observed  object  features. 
Conventionally,  /.  -means  fits  into  the  latter  framework.  Here,  the  interactive  version  does  not 
enjoy  a  reduced  measurement  complexity,  because  all  objects  must  be  observed,  but  it  can  lead 
to  running  time  improvements  as  fewer  distance/kernel  evaluations  are  required. 
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A  less  traditional  way  to  use  fc- means  is  to  represent  each  object  as  a  n-dimensional  vector  of 
its  similarity  to  each  other  object.  Here,  we  can  apply  A:- means  to  a  n  x  n  similarity  matrix, 
much  like  we  can  apply  Spectral  Clustering,  and  this  algorithm  can  be  made  interactive  using 
our  framework.  While  we  cannot  develop  theoretical  guarantees  for  this  algorithm,  which  we 
call  ActiveKMeans,  our  experiments  demonstrate  that  it  performs  very  well  in  practice. 


3.2.4  Fundamental  Limits 

Universal  Limits:  We  now  turn  to  lower  bounding  the  number  of  similarities  needed  to  recover  a 
hierarchical  clustering.  We  first  give  a  necessary  condition  on  the  number  of  similarities  needed 
by  any  algorithm  in  the  absence  of  noise.  Note  that  this  result  applies  to  both  interactive  and 
non-interactive  algorithms.  We  prove  the  following  theorem: 

Theorem  3.3.  Consider  a  noiseless  hierarchical  clustering  problem  on  n  objects  with  minimum 
cluster  size  m.  For  any  algorithm  A,  if  A  guarantees  exact  recover  of  the  true  hierarchy  it  must 
be  the  case  that  A  has  measurement  complexity 

nlo^{£) 

log2  [log 2(n/m)  +  1]' 

This  theorem  asserts  that  Q(n  log  n/  log  log  n)  measurements  are  necessary  to  recover  all  con¬ 
stant  sized  clusters,  even  in  the  absence  of  noise.  The  proof  uses  two  main  ideas.  First,  we  use 
a  combinatorial  argument  to  count  the  total  number  of  hierarchical  clusterings  on  n  objects  with 
minimum  cluster  size  m.  Then,  we  use  an  adversarial  construction,  whereby  a  learner  attempts 
to  identify  the  clustering  while  an  adversary  attempts  to  hide  it.  In  similar  spirit  to  version-space 
algorithms,  we  show  that  for  any  query  made  by  the  algorithm,  the  adversary  can  provide  a  re¬ 
sponse  that  does  not  significantly  reduce  the  number  of  consistent  clusterings.  Combining  this 
with  the  counting  argument  gives  a  necessary  condition  on  the  number  of  queries  any  algorithm 
must  make. 

Notice  that  Theorem  3.1  shows  that  Algorithm  4  uses  Q(n  log2  n)  similarities  to  recover  clusters 
up  to  size  log  n.  This  difference  in  measurement  complexity  in  comparison  with  the  necessary 
condition  in  Theorem  3.3  is  due  to  the  fact  that  Algorithm  4  was  designed  to  be  robust  to  noise, 
and  to  get  closer  to  the  fundamental  limit,  one  must  build  a  more  brittle  algorithm.  Specifically, 
one  can  nearly  achieve  the  limit  in  Theorem  3.3  by  a  simple  algorithm  that  samples  one  point, 
thresholds  the  similarities  between  all  of  the  objects  and  that  point  to  form  two  clusters,  and  then 
recursively  partitions.  We  show  rigorously  that  this  algorithm  will  recover  all  of  the  clusters  and 
that  it  uses  0(n  logn)  similarities,  which  we  summarize  in  the  following: 

Proposition  3.4.  Let  C*  be  a  2-way  hierarchy  with  balance  factor  p  =  1  where  K  satisfies 
K1  and  K2  with  o  =  0.  Then  there  exists  an  algorithm  that  uses  n  log (n/m)  similarities  and 
deterministically  recovers  all  clusters  of  size  at  least  m  in  C*. 

This  shows  that  roughly  n  log  n  similarities  are  necessary  and  sufficient  to  recover  hierarchical 
clusterings  on  n  objects.  Interestingly,  if  m  is  large,  then  far  fewer  similarities  are  necessary,  and 
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as  we  will  see  non-interactive  algorithms  come  close  to  achieving  these  fundamental  limits.  If 
m  =  n/2,  or  just  one  partition  is  required,  the  algorithm  used  to  prove  Proposition  3.4  is  actually 
non-interactive,  and  it  meets  the  necessary  condition  established  in  Theorem  3.3,  which  applies 
to  both  interactive  and  non-interactive  algorithms.  On  the  other  hand,  for  m  <7  n,  interactivity 
appears  to  be  necessary  to  achieving  low  measurement  complexities. 

Limits  on  non-interactive  algorithms:  In  order  to  demonstrate  performance  gains  from  the 
interactive  sampling  model,  it  is  important  to  establish  necessary  conditions  on  non-interactive 
procedures.  This  is  precisely  the  content  of  our  next  result. 

To  state  the  theorem  precisely,  we  need  some  new  definitions.  We  define  a  class  of  models 
'H(n.  m,  7)  as  the  set  of  all  hierarchical  clusterings  on  n  objects  where  the  minimum  cluster  size 
is  m  and  the  difference  between  within  and  between  cluster  similarities  is  7  at  every  level  of  the 
hierarchy.  Thus,  every  model  C  £  'H(n,  rn,  7)  corresponds  to  a  n  x  n  similarity  matrix  M\C.\. 
In  the  non-interactive  setting,  we  are  given  a  sensing  budget  r  and  are  allowed  to  distribute  this 
sensing  budget  across  the  coordinates  of  the  similarity  matrix.  A  sensing  strategy  is  a  matrix 
B  £  M” x  "  such  that  JA  .  hh]  <  r.  Given  this,  our  observation  is  the  matrix: 

Af  =  Af«[C]  +  B«1/2V(  0, 1)  =  JV(My[C],  Bp) 


Note  that  this  setup  is  a  generalization  of  non-interactive  sampling  for  hierarchical  clustering 
(with  cr2  =  1)  considered  earlier,  as  one  can  obtain  a  non-integral  number  of  samples  per  similar¬ 
ity,  rather  than  just  a  single  sample  for  a  subset  of  the  similarities.  A  typical  sampling  approach 
for  hierarchical  clustering  has  Btj  £  (0, 1}  for  all  1 ,  j  with  JA  .  B,:)  as  the  measurement  budget. 
Our  set  up  strictly  generalizes  this  class  of  sampling  strategies. 

Given  such  an  observation,  the  goal  of  a  recovery  algorithm  T  will  be  to  identify  the  model  C 
with  low  probability  of  error.  Specifically  we  are  interested  in  lower  bounding  the  minimax  risk: 

7) ,  7“)  =  inf  sup  F[T(A)  f  C] 

B:\\B\\i<t,T  C£H(n 

Notice  that  this  probability  of  error  is  exactly  the  same  as  the  probability  that  the  algorithm  T  fails 
to  recover  all  clusters  of  size  m.  This  is  a  special  case  of  the  structured  normal  means  problem 
studied  in  Chapter  5.  A  consequence  of  Theorem  5.5  and  Proposition  5.6  is  the  following: 
Theorem  3.5.  If  n,  m  are  both  powers  of  two,  then  the  minimax  risk,  when  observing  the  entire 
matrix,  is  bounded  from  below  by  1/2  when  7  <  Under  budget  constraint  t,  the 

minimax  risk  is  bounded  from  below  by  1/2  when: 


7  < 


G) 


r(8m  —  4) 


log  (nmf  6) 


The  first  part  of  this  theorem,  where  the  entire  matrix  is  observed,  was  established  by  Balakrish- 
nan  et  al.  [16],  although  our  proof  technique  is  more  general.  The  second  part  of  the  theorem 
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is  entirely  new,  and  it  lower  bounds  the  performance  of  any  passive  sampling  strategy  followed 
by  any  recovery  algorithm  for  the  hierarchical  clustering  problem.  To  compare  with  our  inter¬ 
active  approach  and  the  guarantee  in  Theorem  3.2,  we  set  7  =  0(1)  and  solve  for  r,  arriving  at 
r  x  ]o^n"' .  We  compare  this  bound  to  our  interactive  approach. 

To  do  this  comparison,  note  that  since  we  are  in  a  constant  SNR  regime,  we  can  simply  apply 
Theorem  3.2.  This  result  says  that  the  interactive  procedure  uses  0(ns\og(n))  measurements 
and  recovers  all  clusters  of  size  m  >  s,  provided  that  s  =  ce(logn).  If  we  set  s  =  @(log2n), 
we  see  that  we  can  recover  clusters  of  size  m  with  sensing  budget  independent  of  m,  i.e.  with 
sensing  budget  ©(npolylog(n)),  provided  that  m  =  Qflog2  n).  Non-interactive  procedures  can 
achieve  nearly  linear  sensing  budget  only  if  the  smallest  cluster  sizes  are  also  nearly  linear. 

This  shows  significant  performance  gain  due  to  interactive  sampling,  and  this  gain  is  most  strik¬ 
ing  when  recovering  big  clusters,  at  the  top  of  the  hierarchy  and  small  clusters  towards  the  bot¬ 
tom.  This  is  line  with  the  main  theme  of  our  thesis,  that  interactive  procedures  are  particularly 
powerful  in  the  presence  of  non- uniformity,  which  in  this  case  relates  to  the  cluster  sizes. 


3.3  Experimental  Results 


In  this  section  we  describe  our  empirical  evaluation  of  the  interactive  clustering  approaches  de¬ 
scribed  in  Section  3.2.  We  start  with  several  practical  considerations. 


3.3.1  Practical  Considerations 

ActiveS pectral  as  stated  has  some  shortcomings  that  enable  theoretical  analysis  but  that  are 
undesirable  for  practical  applications.  Specifically,  the  fact  that  k  is  known  and  constant  across 
splits  in  the  hierarchy  and  the  balancedness  condition  are  both  assumptions  that  are  likely  to  be 
violated  in  any  real-world  setting.  We  therefore  develop  a  variant  of  ActiveS  PECTRAL,  called 
HeurSpec,  with  several  heuristics. 

First,  we  employ  the  popular  eigengap  heuristic,  in  which  the  number  of  clusters  k  is  chosen  so 
that  the  gap  in  eigenvalues  Xk+\  —  Xk  of  the  Laplacian  is  large.  Secondly,  we  propose  discarding 
all  subsampled  objects  with  low  degree  (when  restricted  to  the  sample)  in  the  hopes  of  removing 
underrepresented  clusters  from  the  sample.  In  the  averaging  phase,  if  an  object  is  not  highly 
similar  to  any  cluster  represented  in  the  sample,  we  create  a  new  cluster  for  this  object.  We 
expect  that  in  tandem,  these  two  heuristics  will  help  us  recover  small  clusters.  By  comparing  the 
performance  of  HeurSpec  to  that  of  ActiveSpectral,  we  indirectly  evaluate  these  heuristics. 
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Figure  3.2:  Simulation  experiments.  Top  row:  Noise  thresholds  for  Algorithm  5,  k-means  clus¬ 
tering,  ActiveS pectral,  and  ActiveKMeans  with  s  =  log2(n)  for  interactive  algorithms. 
Bottom  row  from  left  to  right:  probability  of  success  as  a  function  of  s  for  n  =  256,  a  =  0.75, 
outlier  fractions  on  noisy  CBM,  probing  complexity,  and  runtime  complexity. 

3.3.2  Simulations 

In  this  section  we  present  some  empirical  results  on  synthetic  data.  By  Theorem  3.2,  we  expect 
ActiveS  pectral  to  be  robust  to  a  constant  amount  of  noise  cr,  meaning  that  it  will  recover  all 
sufficiently  large  splits  with  high  probability.  In  comparison,  Balakrishnan  el  al.  [16],  show  that 
spectral  clustering  can  tolerate  noise  growing  with  n.  We  contrast  these  guarantees  by  plotting 
the  probability  of  successful  recovery  of  the  first  split  in  a  noisy  CBM  as  a  function  of  a  for 
different  n  in  Figure  3.2.  3.2(a)  demonstrates  that  indeed  the  noise  tolerance  of  spectral  clus¬ 
tering  grows  with  n  while  3.2(c)  demonstrates  that  ActiveSpectral  enjoys  constant  noise 
tolerance.  Figures  3.2(b)  and  3.2(d)  suggest  that  similar  guarantees  may  hold  for  7'- means  and 
ActiveKMeans. 

Our  theory  also  predicts  that  increasing  the  sampling  parameter  improves  the  performance  of 
ActiveSpectral.  To  demonstrate  this,  we  plot  the  probability  of  successful  recovery  of  the 
first  split  of  a  noisy  CBM  of  size  n  =  256  as  a  function  of  s  for  fixed  noise  variance.  We  compare 
three  algorithms,  ActiveSpectral,  ActiveKMeans,  and  Algorithm  1  of  Shamir  and  Tishby 
[  ],  which  subsamples  entries  of  the  similarity  matrix.  In  theory,  ActiveSpectral  requires 

f 1{n  logn)  total  measurements  to  recover  a  single  split,  whereas  Shamir  and  Tishby  [15  ]  show 
that  their  algorithm  requires  VL(n  log3/2  n)  (recall  that  this  does  not  immediately  translate  into 
a  clustering  guarantee).  Figure  3.2(e)  demonstrates  that  this  improvement  is  also  noticeable  in 
practice. 

The  simulations  in  Figures  3.2(a)-(e)  only  examine  the  ability  of  our  algorithms  to  recover  the 
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first  split  of  a  hierarchy,  while  our  theory  predicts  that  all  sufficiently  large  clusters  can  be  reliably 
recovered.  One  way  to  measure  this  is  the  outlier  fraction  metric  between  the  clustering  returned 
by  an  algorithm  and  the  true  hierarchy  [87].  For  any  triplet  of  objects  x%,  x3,  xi,  we  say  that  the 
two  clusterings  agree  on  this  triplet  if  they  both  group  the  same  pair  of  objects  deeper  in  the 
hierarchy  relative  to  the  third  object  and  disagree  otherwise.  The  outlier  fraction  is  simply  the 
fraction  of  triplets  for  which  the  two  clusterings  agree. 

In  Figure  3.2(f),  we  plot  the  outlier  fraction  for  six  algorithms  as  a  function  of  a  on  the  noisy 
HBM.  The  algorithms  are:  Hierarchical  Spectral  (SC),  Single  Linkage  (SL),  HeurSpec  (HSC), 
ActiveSpectral  (ASC),  Hierarchical  fc-Means  (KM),  and  ActiveKMeans  (AKM).  These 
experiments  demonstrate  that  the  non-interactive  algorithms  (except  single  linkage)  are  much 
more  robust  to  noise  than  the  corresponding  interactive  ones,  as  predicted  by  our  theory,  but  also 
that  the  heuristics  described  in  Section  3.3.1  have  dramatic  impact  on  performance. 

Lastly,  we  verify  the  measurement  and  run  time  complexity  guarantees  for  our  interactive  al¬ 
gorithms  in  comparison  to  the  non-interactive  versions.  In  Figure  3.2(g)  and  3.2(h),  we  plot 
the  number  of  measurements  and  running  time  as  a  function  of  n  on  a  log-log  plot  for  each 
algorithm.  The  three  non-interactive  algorithms  have  steeper  slopes  than  the  interactive  ones, 
suggesting  that  they  are  polynomially  more  expensive  in  both  cases. 


3.3.3  Real  World  Experiments 

To  demonstrate  the  practical  performance  of  our  framework,  we  apply  our  algorithms  to  three 
real-world  datasets  and  one  additional  synthetic  dataset.  The  datasets  are:  The  set  of  articles 
from  NIPS  volumes  0  through  12  [148],  a  subset  of  NPIC500  co-occurrence  data  from  the  Read- 
the-Web  project  [131  ]  which  we  call  RTW,  a  SNP  dataset  from  the  HGDP  [138],  and  a  synthetic 
phylogeny  dataset  produced  using  phyclust  [  ]. 

The  NIPS  dataset  consists  of  1740  machine  learning  research  articles  from  Neural  Information 
Processing  Systems  Volumes  0-12.  Each  article  was  converted  into  a  TF-IDF  vector  and  pairs  of 
vectors  were  compared  using  cosine  similarity. 

The  RTW  data  is  a  subsampled  version  of  the  NPIC500  co-occurrence  dataset.  It  originally 
consisted  of  88k  noun-phrases  and  99k  contexts  with  NP-context  co-occurrence  information. 
We  further  down-sampled  to  2000  NPs  and  used  TF-IDF  and  cosine  similarity  to  construct  a 
noun-phrase  by  noun-phrase  co-occurrence  matrix. 

The  SNP  dataset  consists  of  base  pair  information  at  2810  loci  for  957  individuals.  The  dataset 
is  annotated  into  three  levels,  where  each  individual  is  assigned  a  population,  country  of  origin, 
and  continent.  Each  individual  has  two  haplotype  sequences,  and  we  arbitrarily  chose  the  ma¬ 
ternal  haplotype.  We  measure  similarity  using  edit  distance.  In  this  case,  computing  all  pairwise 
similarities  is  computationally  intensive;  it  took  over  1  hour  for  this  computation. 

The  phylogeny  dataset  is  a  synthetic  phylogeny  generated  by  the  phyclust  R  package.  It  con¬ 
sists  of  2048  genetic  sequences,  each  consisting  of  2000  base  pairs,  phyclust  also  generates  a 
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Figure  3.3:  Experiments:  3.3(a):  Comparison  of  algorithms  on  various  datasets.  3.3(b):  Outlier 
fractions  on  datasets  with  ground  truth  clustering.  3.3(c):  Subset  of  the  NIPS  hierarchy. 


reference  phylogeny  that  serves  as  ground  truth.  As  with  the  SNP  data,  we  measured  similarity 
using  edit  distance.  Computing  all  pairs  of  similarities  took  over  4  hours. 

In  the  phylogeny  and  SNP  datasets,  we  have  access  to  a  reference  tree  that  can  be  used  in  our 
evaluation.  In  these  cases  we  can  report  the  outlier  fraction,  as  we  did  in  simulation.  However, 
the  other  datasets  lack  such  ground  truth  and,  without  it,  evaluating  the  performance  of  each 
algorithm  is  non-trivial.  Indeed,  there  is  no  well-established  metric  for  this  sort  of  evaluation. 

For  this  reason,  we  employ  two  distinct  metrics  to  evaluate  the  quality  of  hierarchical  clusterings. 
They  are  a  hierarchical  A'-means  objective  (HKM)  [115]  and  an  analogous  hierarchical  ratio-cut 
(HRC)  objective,  both  of  which  are  natural  generalizations  of  the  /;:- means  and  ratio  cut  objectives 
respectively,  averaging  across  clusters,  and  removing  small  clusters  as  they  bias  the  objectives. 
Formally,  let  C  be  the  hierarchical  clustering  and  let  C  be  all  of  the  clusters  in  C  that  are  larger 
than  log  n.  For  each  C  G  C  let  xc  be  the  cluster  center.  Then: 

HKM(C)  =  i  Ecec  W\  ^-ec  IMIl^cll 

tTRrYrA  _  1  _  y''  a  (Ck,c\Ck) 

nwwioj  —  |C|  Z^ceC  l^CkCC  2|Cfc| 

In  Table  3.3(a)  and  3.3(b),  we  record  experimental  results  across  the  datasets  for  our  algorithms. 
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On  the  read-the-web  dataset,  we  were  unable  to  run  the  non-interactive  algorithms.  On  the  SNP 
and  phylogeny  datasets,  we  include  computing  similarities  via  edit  distance  in  the  running  time 
of  each  algorithm,  noting  that  computing  all  pairs  takes  6500  and  15000  seconds  respectively. 
The  immediate  observation  is  that  these  algorithms  are  extremely  fast;  on  the  SNP  and  phy¬ 
logeny  datasets,  where  computing  similarities  is  the  bottleneck,  our  approach  leads  to  significant 
performance  improvements.  Moreover,  the  algorithms  perform  well  by  our  metrics;  they  find 
clusterings  that  score  well  according  to  HKM  and  HRC,  or  that  have  reasonable  agreement  with 
the  reference  clustering1 . 

We  are  also  interested  in  more  qualitatively  understanding  the  performance  of  these  algorithms. 
For  the  NIPS  data,  we  manually  collected  a  small  subset  of  articles  and  visualized  the  hierarchy 
produced  by  ActiveKMeans  restricted  to  these  objects.  The  hierarchy  in  Figure  3.3(c)  is  what 
one  would  expect  on  the  subset,  attesting  to  the  performance  ActiveKMeans.  On  the  other 
hand,  this  same  evaluation  on  the  RTW  data  demonstrates  that  interactive  algorithms  do  not 
perform  well  on  this  dataset,  while  the  non-interactive  algorithms  do.  We  suspect  this  is  because 
the  RTW  dataset  consists  of  many  small  clusters  that  do  not  get  sampled  by  our  approach. 

For  the  SNP  and  phylogeny  datasets,  the  permuted  heatmaps  are  clear  enough  to  be  used  in 
qualitative  evaluations.  These  heatmaps  are  shown  in  Figure  3.4,  and  they  suggest  that  all  three 
interactive  algorithms  perform  very  well  on  these  datasets.  Heatmaps  for  the  remaining  datasets 
are  less  clear,  but  are  included  for  completeness. 


3.4  Proofs 


In  this  section  we  provide  proofs  for  all  theorems  in  this  chapter. 


3.4.1  Proof  of  Theorem  3.1 

Before  beginning  the  proof  of  the  three  claims  in  Theorem  3.1,  we  first  state  and  prove  two 
simple  lemmas  bounding  the  number  of  splits  and  levels  in  a  balanced  hierarchy. 

Lemma  3.6.  A  k-way  hierarchical  clustering  on  n  objects  has  at  most  splits. 


Proof.  A  hierarchical  clustering  can  be  represented  as  a  rooted  tree  T,  where  each  leaf  is  a 
singleton  cluster  and  each  internal  node  corresponds  to  a  cluster  containing  all  objects  below  this 
node.  Every  k- way  hierarchy  can  be  represented  by  a  k- ary  tree  and  the  number  of  internal  nodes 
in  the  k- ary  tree  exactly  corresponds  to  the  number  of  splits  in  the  k- way  hierarchy.  Let  f(x)  be 
the  number  of  internal  nodes  in  a  k- ary  tree  with  x  leaves.  It  is  easy  to  see  that  the  recurrence 

1  The  SNP  dataset  is  a  k- way  hierarchy  and  our  algorithms  (apart  from  HeurSpec)  recover  binary  hierarchies 
that  cannot  have  high  agreement  with  the  reference. 


56 


Figure  3.4:  Heatmaps  of  permuted  matrices  for  SNP,  Phylo,  NIPS,  and  RTW  (from  left  to  right). 
Algorithms  are  HeurSpec,  ActiveSpectral,  and  ActiveKMeans  from  top  to  bottom. 


f(x)  =  f(x  —  k  +  1)  +  1  holds  for  all  x  >  k  and  f(x)  =  1  for  all  0  <  x  <  k.  Solving  this 

recurrence,  we  see  that  /(n)  <  proving  the  Lemma. 

Lemma  3.7.  Let  rj  be  the  balance  factor  of  the  hierarchy  and  let  l  be  the  total  number  of  levels 
in  the  hierarchy.  Then: 


l  < 


log  n  <  Cv  log  n 


Proof  Note  that  for  any  split,  the  larger  of  the  two  clusters  has 
l  levels,  we  want  the  largest  cluster  to  have  size  at  most  1,  or 

/  \  i 


V 


n  <  1 


1  +  rj 


Solving  for  l  in  this  equation  yields  the  result. 


a 

i+a 


fraction  of  the  nodes.  After 


We  now  turn  to  proving  the  theorem.  In  the  proof,  we  will  define  several  failure  events  and  first 
show  that  the  algorithm  succeeds  if  none  of  the  failure  events  occur.  We  will  then  proceed  to 
bound  the  probability  of  each  of  the  failure  events. 
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We  establish  some  notation  before  proceeding.  In  the  true  hierarchy,  we  will  denote  each  partition 
problem  (or  split)  by  Si, . . .  ,  (recall  that  by  Lemma  3.6,  there  are  at  most  of  these). 
Moreover,  each  split  except  for  the  split  at  the  root  of  the  hierarchy  has  a  parent  split,  which  is 
the  clustering  problem  directly  above  it  in  the  hierarchy.  For  a  split  S*,  denote  its  parent  split  by 
S7r(,)  so  that  tt('/')  is  the  index  of  V s  parent  in  the  hierarchy. 

For  each  split  Sj,  we  have  three  types  of  error  events:  a  subsampling  error  event,  a  error  event  on 
the  correctness  of  the  algorithms  A  and  an  error  event  on  the  averaging  phase.  In  the  subsampling 
phase,  we  will  report  an  error,  if  the  subsampled  balance  factor  for  the  clustering  problem  at 
split  i,  fj  is  larger  than  2  7/  +  1  (we  will  precisely  define  fj  subsequently).  If  fj  <  2?/  +  1,  then 
the  assumption  that  rj  =  0(1)  implies  that  r)  =  0(1)  so  that  A  can  successfully  cluster  the 
subsample.  Formally,  these  error  events  are  defined  as  follows: 

Si  =  {at  split  Su  fj  >  2tj  +  1} 

Ai  =  {Algorithm  A  fails  at  split  A,  } 

Vi  =  {Averaging  fails  at  level  <S/:} 


It  is  easy  to  see  that: 

n 

P [failure]  <  P[(J  5)  U  A,  U  V)].  (3.2) 

i= 1 

In  words,  the  algorithm  only  fails  if  one  of  these  error  events  occurs.  At  this  point,  one  could  use 
a  union  bound  to  decompose  this  further  into  a  sum  of  failure  probabilities,  but  it  is  challenging 
to  bound  each  failure  probability  independently  of  the  other  events.  Instead,  we  will  appeal  to 
the  following  lemma  to  upper  bound  the  right  hand  side  via  a  more  suitable  decomposition. 
Lemma  3.8.  Let  B0,  Bi, . . . ,  Bt  be  events  in  some  measurable  space.  Then: 

t  t 

P[|J  Bi]  <  W]  +  p[5‘b50,  •  •  •  , 

i=0  i=l 


Proof.  First,  the  following  identity  is  straightforward: 

t  t  (  i 

Ufli  =  U  Unfits, 

i= 0  i=0  \  j= 0 

Now,  using  a  union  bound  and  the  chain  rule: 


p 

UB< 

<±F 

i 

Bi  p|  =b, 

=  ±F 

i 

P 

i 

Bi\  p|  ^Bj 

_i=0 

i= 0 

.  3=0 

1=0 

L?=o 

3=0 

<  ^  ¥[Bi\->B0, . . . ,  ~>Bi_ i], 

i=0 
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Where  in  the  last  step  we  used  that  probabilities  must  be  upper  bounded  by  1. 


Using  Lemma  3.8,  we  can  decompose  the  right  hand  side  of  Equation  3.2  as: 


P[Si]  +  IP[v4x  |  'Si]  +  P[Vi|-iS'i,  'Ai\  + 


n 


Next  we  exploit  independence  of  events  to  simplify  each  of  the  expressions.  In  particular,  we 
have  the  following  independence  assertions:  each  subsampling  phase  is  independent  of  all  previ¬ 
ous  error  events,  conditioned  on  the  successful  recovery  of  the  corresponding  parent  clustering, 
each  execution  of  the  algorithm  succeeds  (or  fails)  independent  of  every  previous  failure  event, 
conditioned  on  the  success  of  subsampling  at  that  split,  and  each  averaging  phase  succeeds  (or 
fails)  independent  of  every  previous  failure  event,  conditioned  on  the  success  of  sampling  and 
the  black-box  algorithm  at  that  split.  With  this  assertions  we  can  reduce  the  above  expression  to: 


p[u,h 


k- 1 
2—1 


i=2 


i—  1 


In  the  subsequent  sections,  we  will  bound  each  of  these  conditional  probabilities.  By  showing 
that  the  sum  of  these  conditionals  is  small,  we  will  arrive  an  an  upper  bound  on  the  failure 
probability  of  our  algorithm. 


The  Subsampling  Phase 

Here  we  bound  the  probability  of  the  event  S),  conditioned  on  the  successful  recovery  of  <S,’s 
parent  cluster.  We  need  to  demonstrate  that  the  balance  factor  fj,  restricted  to  the  subsample,  is 
upper  bounded  by  2?/  +  1  after  subsampling  s  objects,  and  moreover  we  have  to  do  this  across  all 
splits  of  the  hierarchy.  Consider  one  split  at  first;  we  have  n  objects  and  k  clusters  Cj, . . . ,  Ck, 
and  define  the  random  variables  Xu  ...  ,XS  e  [k]  which  indicates  cluster  membership  of  the  /th 
draw.  Define  the  estimators  Cj  =  jCj  l  1  [2Q  =  j],  so  that  E[— ]  =  Cj  |  jn.  In  both  the  cases,  of 
sampling  with  and  without  replacement,  we  can  apply  Hoeffding’s  inequality  and  union  bound 
over  the  cluster  Cj.  Technically,  in  the  case  of  sampling  without  replacement  we  must  apply 
a  bound  due  to  Serfling  [  52],  but  it  is  no  worse  than  Hoeffding’s  inequality  for  independent 
random  variables.  We  obtain: 
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Using  Lemma  3.6,  and  a  union  bound  over  the  events  Si,  and  inverting  the  concentration  inequal¬ 
ity,  we  have  that  with  probability  1  —  Si,  for  all  splits  S,  and  cluster  Cf 


1.  \Cj\  /log(2 nk/(k  -  1))  +  log(l/<5i)  /log4n  +  log(l/5i) 

~sCi~]S~\  - 2s - -  V - 2s — ’ 

where  |<S,:  is  the  total  number  of  objects  to  be  clustered  at  split  S,.  Using  the  fact  that  the 
hierarchy  has  balance  factor  77,  which  holds  here  since  we  are  conditioning  on  successful  recovery 
of  the  parent  cluster  at  each  step,  we  obtain: 


1  „  .  rj  /log  4  n  +  log(l/5i) 

s  j  J  ~  1  +  r]  V  2s 

1  .  „  1  /log4n  +  log(l/51) 

s  j  J  ~  I  +  77  V  2s 

and  the  modified  balance  factor  77  is  the  ratio  of  these  two  quantities.  Setting  4]  =  4n  exp{  2(1~^)2 } 
gives  that  77  <  2?7  +  1  =  0(1)  so  that  the  event  Si  does  not  hold,  across  all  splits  S,.  This  is  the 

first  term  in  Equation  3.1,  and  if  s  meets  the  lower  bound  in  the  theorem,  we  have  that  4]  =  o(l) 

as  needed. 


The  Clustering  Phase 

In  the  clustering  phase,  we  simply  need  to  add  up  the  probabilities  of  failure  for  all  execu¬ 
tions  of  the  algorithm  A,  conditioned  on  the  fact  that  the  subsampling  phase  for  this  split 
yielded  a  constant  balance  factor.  By  assumption  A  fails  on  an  input  of  size  s  with  probabil¬ 
ity  0(skci  exp(— s)).  With  a  union  bound  across  all  splits,  the  probability  of  any  execution  of 
A  failing  is  0(^-/cexp(— s))  =  0(?r2ci  exp(— s))  (where  we  used  Lemma  3.6).  This  is  the 
second  term  in  the  bound  in  Equation  3.1  and  as  long  as  s  =  cu(logn),  this  failure  probability  is 
o(l). 


The  Averaging  Phase 

Here  our  goal  is  to  show  that  as  long  as  subsampling  and  the  subroutine  clustering  algorithm 
succeeded,  then  the  averaging  phase  will  also  succeed  with  high  probability.  The  guarantees  for 
the  averaging  phase  follow  from  assumption  K2.  In  order  to  ensure  that  we  place  every  object  in 
its  correct  cluster,  we  require  that: 

JT  K(xiAj)  >  ^  Y  K(XiAj), 

XktCj 
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for  all  Xj  e  Cv  for  all  j'  ^  j  and  across  all  splits.  Here,  we  say  that  C3  =  {x3  e  C3}  fl  {xj  G  S'} 
and  dj  =  \Cj  \  for  all  j.  By  assumption  K2  and  a  union  bound,  we  have  that: 


log(C^w)  +  log  log  n  +  log (4/ £3) 
2c, 


—  K(xi,xk)  >  min  E[K(xi,xk)}  -  crW 

c7-  z '  xkeCj  y 

<  -ax  E[A-(,„xt)]  +  +  '°gl0g" 

y  xkeCjf  y  zCj/ 

xk€.Cjf 


For  the  within  cluster  similarities  we  union  bounded  over  each  of  the  Cv  log  n  levels,  because 
each  object  belongs  to  only  one  cluster  per  level.  For  between  cluster  similarities,  we  union 
bounded  over  the  Cv  log  n  levels  and  the  k  —  1  <  k  clusters  that  we  will  compare  to  for  each 
object  Xi.  Both  equations  hold  with  probability  1  —  <53,  because  we  used  <53/2  as  the  individual 
probability  of  failure.  Note  also  that  we  replace  Mj  in  assumption  K2  with  the  sets  Cj  and  Cj-q 
because  those  sets  are  chosen  uniformly  at  random,  we  can  make  this  replacement. 


Replacing  c3  and  c.y  both  with  the  lower  bound  on  the  subsampled  cluster  sizes,  arising  from 
the  bound  on  fj,  and  observing  that  if  the  lower  bound  for  the  first  expression  is  larger  than  the 
upper  bound  for  the  second  expression,  we  will  make  no  mistakes  at  all  splits  of  the  hierarchy, 
we  obtain  the  following  lower  bound  on  7,  defined  in  assumption  Kl: 

7  >  2a^1  ^  ?^-(log (Crjkn)  +  loglogw  +  log(4/g3))  (3.3) 

Solving  this  equation  for  d:>  gives: 

5  <  AC,nk  log  n  exp  j  }  •  (3.4) 

This  is  the  final  term  in  Equation  3.1. 


Proof  of  R1  and  R2 

The  measurement  complexity  and  running  time  are  straightforward  calculations.  At  each  level 
of  the  hierarchy  we  obtain  at  most  ns  similarities  and  there  are  at  most  O(logn)  levels  of  the 
hierarchy,  so  that  the  total  measurement  complexity  is  0(ns  log  n) .  As  for  the  running  time,  we 
only  ever  call  A  on  problems  of  size  s,  and  there  are  at  most  n  such  clustering  problems  which 
gives  the  first  term  in  the  running  time.  The  second  term  is  the  total  running  time  for  all  averaging 

phases  across  the  hierarchy,  which  at  each  level  takes  ns  time. 


3.4.2  Proof  of  Theorem  3.2 

Theorem  3.2  is  almost  a  direct  application  of  Theorem  3.1.  We  must  first  verify  that  the  noisy 
CBM  family  satisfies  the  assumption  Kl,  K2,  and  that  Algorithm  5  satisfies  something  close  to 
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assumption  Al.  In  the  noisy  CBM,  since,  in  expectation,  the  within  cluster  similarities  are  at  least 
7  larger  than  the  between  cluster  similarities,  assumption  K1  is  satisfied.  Assumption  K2  is  also 
satisfied  with  a2  exactly  corresponding  to  the  noise  variance  of  the  subgaussian  perturbation,  and 
this  follows  from  the  fact  that  subgaussian  random  variables  enjoy  exponential  concentration. 


To  check  that  an  assumption  of  A 1 -type  is  satisfied,  we  will  have  to  reproduce  some  of  the  proof 
of  Balakrishnan  et  al.  [16].  All  of  the  facts  stated  here  without  proof  are  from  [16].  First,  Lemma 
7  in  [  16]  characterizes  the  spectral  properties  of  the  Laplacian  of  the  constant  block  matrix  A, 
without  perturbation.  If  the  eigenvalues  of  La  are  Ai  <  A2  <  . . .  Xn  and  the  eigenvectors  are 
v^, ... ,  iAn),  then  they  show: 

1.  rA1)  =  -4=1  with  Ai  =  0. 

sjn  1 

2.  <  \vf  \  <  for  all  i  e  [n\,  with  A2  =  n/30.  A)  is  the  similarity  between  objects 

that  are  separated  at  the  first  split  of  the  hierarchy.  Moreover,  the  sign  pattern  of  v(2)  reveals 
the  coarsest  partition  of  the  clustering  that  generates  A. 

3-  ^(vPo  +  min {/3k,  (3R})  <  A3  <  ^(A)  +  n  max{/3L,  f3R}). 

Analysis  of  the  perturbation  matrix  reveals  the  its  Laplacian  has  spectral  norm  bounded  by: 


4 

2  log n  +  2  log 

0 

with  probability  at  least  1  —  5.  Equipped  with  the  spectral  properties  and  the  bound  on  the 
perturbation,  we  can  apply  the  Davis-Kahan  theorem  [  70].  Let  L\v  be  the  Laplacian  matrix  of  W 
and  denote  the  eigenvalues  /ii  <  /z2  <  •  •  •  <  Ln  and  the  associated  eigenvectors  u^\  . . . , 

The  Davis-Kahan  theorem  states  that: 


luW  -v®\ 


2  < 


V2\\Lr 

6 


where  ^  =  min^  A,  —  X:)  \ .  This  bound  shows  that  the  eigenvectors  of  L\\-  are  close  to  the 
eigenvectors  of  La,  which  we  know  reveal  the  cluster  structure. 


However,  a  more  refined  bound  is  possible.  Let  k  =  u ^  .  We  will  proceed  to  bound  ||  fc||oo, 

which  will  give  us  more  precise  control  on  the  eigenvector  deviation.  Some  algebra  shows  that: 


\m  < 


Ti 


,(2)| 


7)(A2 


_  T2  t3  T4 

7^2)1  +  \Aik\  +  \LRiV^]\  +  \Rik\ 
I  M2  —  Dai  —  DRi  | 

" - V - ' 

t5 


The  term  7]  is  bounded  through  application  of  Weyl’s  inequality  and  the  properties  of  L  4 : 


T !  =  |rA2)(*)(A2 


M2) |  <  |v(2)(i)||A2  -  n2\  < 


Lr ||2  <  2*Vv 


2  log  n  +  2  log 


4 

6 
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The  term  T2  is  bounded  by: 


6 


7 


1  +  T) 


These  two  bounds  hold  jointly  with  probability  at  least  1  —  5. 


For  T3,  we  decompose  into  DR.v^2\i)  and  RjV^2\  The  former  is  a  subgaussian  with  scale  factor 
at  most  Uy/rj  since  1^(1)  <  \fr\Jn  and  DR.  is  a  subgaussian  with  scale  factor  Oyfn.  For  the 
latter,  since  v(2)  is  a  unit  vector,  R,v(2>  is  a  subgaussian  with  scale  factor  <7.  Taking  a  union  bound 
across  all  i  and  using  a  standard  sub-gaussian  tail  bound,  we  have  that: 

T3  <  4cr  a Jt]  log(2 n/8) 


with  probability  at  least  1  —  5. 


For  T4,  we  have  T4  <  \Rik\  <  ||i2j||2||fc||2,  and  1 1 1 1 2  <  ||-R||2,  so  for  n  large  enough  under  the 
1  —  5  event  used  to  bound  Tj,  we  have: 

\\Ri || 2  <  Ca^/n, 


for  some  absolute  constant  C.  This  gives: 


T  ^M2<2^2V2l0gn  +  2l0g5 

T4  <  Cayjn - < 


6 


ry-3- 

1  1  +i) 


Lastly,  we  must  lower  bound  T5.  We  write: 

T5  =  \n2  —  Dai  —  DRi  |  =  |  Dai  +  DRi  —  /i2|. 


Note  that: 

DAi  >  0  +  f3R})  >  n/30  + 

1  +  rj  l  +  ry 

while /i2  <  n/30  +  \\LR\\2  <  n/30+  ||-Dr||2  +  ||i?||2  by  Weyl’s  inequality  and  the  triangle  inequality. 
This  gives: 


T5> 


n'y 

I  +  77 


2110*112 


\\Rh 


provided  that  the  expression  inside  the  absolute  value  is  positive.  We  will  now  show  this  is  indeed 
the  case.  Under  the  same  1  —  5  probability  event  used  to  bound  Tj,  we  have  that: 


nDRh 


II-RII2  <  3(7 y/n 


2  log  77  +  2  log 


4 

5’ 


63 


so  provided  that  a  < 


_  iVn 

3(1+77) -y/ 2  log  n+2  log  4/5’ 


we  will  have: 


2||I>fl||2  —  ll-Rlh  < 


n'y 

W+V) 


n'y 


So  that  \T5\  >  2{1+v). 

Combining  all  of  the  bounds,  we  have  that  with  probability  at  least  1  —  25: 


< 
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The  algorithm  succeeds  if  U^Uoo  <  . .  Rearranging,  if 
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1  +  i]  |  r)  v/logn  +  log(4/5)’  1  +  7  ^logn  +  log(4/5)’  V  loSn  +  log(4/<5)  J  ’ 

then  the  algorithm  succeeds  with  probability  at  least  1  —  5,  where  c  is  some  universal  constant. 
Suppressing  dependence  on  7  and  rearranging  to  solve  for  5,  we  have: 

A  +  (  •  /  72  74  74 

5  <  nexp  I  —n  mm  <  — ,  — ,  — - 

which  meets  assumption  A1  provided  that  7 2 /a  =  f2(l).  This  is  true  since  72  <  7,  which  means 
that  7P/ ap  is  also  0(1)  for  any  p  >  1.  We  actually  use  this  bound  directly  to  obtain  the  second 

term  in  Equation  3.1,  and  this  allows  us  to  obtain  a  guarantee  for  all  values  of  7  and,  a. 


3.4.3  Proof  of  Theorem  3.3 

The  lower  bound  will  be  based  on  an  adversarial  construction.  Let  n  and  m  be  powers  of  two 
with  n  >  m.  We  will  consider  a  perfectly  balanced  binary  hierarchy  and  say  that  the  clusters  at 
level  l  all  have  cluster  size  n/2l.  This  means  that  l  =  0  corresponds  to  the  cluster  containing  all 
of  the  objects  and  at  the  bottom  of  the  hierarchy  and  the  largest  value  of  l  =  log 2(n/m).  The 
similarity  between  two  objects  grouped  at  a  level  /,  but  not  grouped  at  level  l  + 1  will  be  7 ?,  which 
is  fixed  and  known  to  the  algorithm. 

Each  time  the  learner  makes  a  measurement,  the  adversary  will  respond  with  a  value  that  is 
consistent  with  all  existing  measurements,  but  that  keeps  the  number  of  consistent  hierarchi¬ 
cal  clusterings  as  large  as  possible.  The  goal  for  the  learner  is  to  whittle  down  the  size  of  the 
consistent  set  until  there  is  just  a  single  consistent  hypothesis,  and  as  soon  as  this  happens  the 
learner  has  successfully  recovered  the  clustering.  The  goal  for  the  adversary  is  to  provide  as 
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little  information  as  possible  to  the  learner  at  each  round,  so  that  the  learner  must  make  many 
measurements  before  she  is  confident  about  the  clustering. 

For  each  query,  the  adversary  responds  with  one  of  {7 *}|=1,  so  there  are  l  possible  choices.  If 
the  current  size  of  the  consistent  set  is  S,  then  by  the  pigeon-hole  principle,  there  must  exist  a 
choice  of  response  by  the  adversary  for  which  the  size  of  the  subsequent  consistent  set  is  at  least 
S/l.  Therefore,  after  T  rounds,  the  adversary  can  ensure  that  the  size  of  the  consistent  set  has 
reduced  multiplicatively  by  no  more  than  lT.  If  the  size  of  the  consistent  set  is  initially  So  (we 
will  compute  this  size  shortly),  a  necessary  condition  for  the  learner’s  success  is: 


We  now  compute  the  number  of  hierarchical  clusterings  that  extend  /  levels.  We  start  by  consid¬ 
ering  all  n\  permutations  on  n  objects,  and  aim  to  count  the  number  of  permutations  that  induce 
the  same  hierarchical  clustering.  Let  T{n)  be  the  number  of  permutations  that  induce  the  same 
hierarchical  clustering,  where  the  smallest  clusters  have  size  m.  This  means  that  T(rri)  =  m\. 
We  compute  T{n)  recursively:  at  the  top  level,  we  can  permute  the  two  clusters  and  then  use  any 
of  the  permutations  on  the  two  sub-clusters,  leading  to  the  recurrence  T(n)  =  2T(n/2)2,  with 
the  base  case  T(m )  =  m\.  This  recurrence  solves  to  T(n)  =  2"/m_1(m!)n/m,  so  that  the  number 
of  hierarchical  clusterings  with  smallest  cluster  size  m  is: 


u  2n/m-l(m\\n/m 

The  necessary  condition  becomes: 

n\  ^ 

2n/m-1(m!)n/mlog 2(n/m)T  ~ 

Using  the  bound  (■ n/e)n\/2nn  <  n\  <  nn  which  follows  from  Stirling’s  approximation,  the 
necessary  condition  is: 


T  > 


nl°g2(w) 

l°g2  [log2(n/m)  +  1] 


3.4.4  Proof  of  Proposition  3.4 

The  algorithm  we  will  use  is  the  following: 

1 .  Pick  an  object  x, 

2.  Take  the  n/2  objects  x3  with  the  largest  A' (.7,  x3)  values  and  place  them  in  a  cluster  C\ . 
Place  the  remaining  objects  in  a  cluster  C2. 

3.  Recursively  partition  C\  and  C2. 
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Under  the  assumption  that  C*  is  a  balanced  binary  hierarchy,  and  under  assumptions  K1  and  K2, 
this  algorithm  correctly  recovers  the  clustering.  This  is  true  because  between  cluster  similarities 
are  strictly  smaller  than  within  cluster  similarities,  so  every  partitioning  step  is  exact. 

To  recover  a  cluster  of  size  s,  we  use  exactly  s  similarities.  Therefore,  to  recover  all  clusters  up 

to  size  m,  we  use  n  log (n/m)  similarities,  proving  the  theorem. 


3.4.5  Proof  of  Theorem  3.5 

For  the  first  claim,  by  Theorem  5.1,  we  must  lower  bound  the  quantity  W(H,a).  We  interpret 
H  as  a  collection  of  vectors  defined  by  vc  =  vec(M[C])  for  each  C  e  TL.  We  must  lower  bound: 

W (TL,  a)  =  max  E  exp  (||i;c  -  vcWl/a) 
en  c+c 


When  n  and  m  are  both  powers  of  two,  a  subset  of  TL  is  the  set  of  all  perfectly  balanced  binary 
hierarchical  clusterings  with  minimum  cluster  size  m.  Let  Co  be  one  of  these  models.  Consider 
perturbing  Co  by  taking  an  object  and  swapping  that  object  with  another  one  in  the  adjacent  clus¬ 
ter  at  the  deepest  level  of  the  hierarchy.  There  are  nm/2  such  perturbations  and  any  perturbation 
C  has  ||vCo  —  vc\\ 2  =  72(8 m  —  4).  This  gives  the  lower  bound  of: 

W(H,  a)  >  ™  exp  ^(8 m  -  4)j 

By  Theorem  5.1,  if  1U('H,  1)  >  3,  then  the  minimax  risk  is  bounded  from  above  by  1/2.  Apply¬ 
ing  our  lower  bound  and  solving  for  7  proves  the  first  part  of  the  result. 

For  the  second  claim,  if  we  certify  that  the  uniform  sampling  strategy  minimizes  W(TL,a,  B) 
under  the  budget  constraint,  then  we  can  immediately  apply  Theorem  5.5.  We  will  use  Proposi¬ 
tion  5.6  to  achieve  this. 

Focusing  only  on  the  set  of  perfectly  balanced  binary  hierarchical  clusterings  with  minimum 
cluster  size  m,  which  we  call  TL',  it  is  easy  to  see  that  when  B  is  uniform,  every  one  of  these 
hypotheses  achieves  the  maximum  in  the  definition  of  W (B! ,  a,  B ).  Moreover,  notice  that  for 
every  pair  of  pairs  objects  {a,  b},  {a',  b'},  there  is  a  bijection  p  over  74  based  on  swapping  a  with 
a'  and  b  with  b '  in  the  hierarchy  such  that  for  any  hypothesis  17,  we  have  vc(a,  b)  =  vp^c)(a',  b'). 
This  claim  is  fairly  easy  to  see.  If  in  C,  a  and  b  are  clustered  at  some  level  l,  then  by  swapping  a 
with  a'  and  b  with  b'  to  form  p(C),  a'  and  b ’  are  clustered  at  level  l  in  p(C)  so  both  terms  will  be 
identical  because  we  are  in  a  constant  block  model. 
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Since  p  is  a  bijection,  when  we  take  tt  to  be  uniform  over  the  hypotheses,  we  have: 


Ec~tt  ^2(vc(a,b)  -  vC'(a,b))2  exp(-\\vC'  -  vc\\l) 


c^c 


=  E c~n^(vp(C)(a',b')  -  vp(C')(d ,b')2  exp(-\\vp{C')  -  vp{C)\\D 


c+c 


=  E c~n^{vc(a',b')  -  vc>(a',b'))2ex p(-|K’C'  -  vc\\l). 


C’^C 


This  means  we  may  apply  Proposition  5.6,  which  certifies  that  the  uniform  sampling  minimizes 
the  function  W {'H! ,  a,  B)  under  budget  constraint. 

Equipped  with  this  fact,  we  can  reproduce  the  calculation  above  but  with  Bt  =  r/(") ,  giving: 


3.5  Discussion 

In  this  chapter,  we  developed  several  interactive  algorithms  for  hierarchical  clustering.  These 
algorithms  have  strong  computational  and  statistical  guarantees,  and  in  the  case  of  spectral  clus¬ 
tering,  provably  outperform  all  non-interactive  approaches. 

This  chapter  supports  the  main  claim  of  this  thesis:  that  interactivity  leads  to  improvements  in 
computational  and  statistical  performance,  particularly  when  datasets  exhibit  non-uniformity.  In 
this  chapter  non- uniformity  is  measured  in  terms  of  the  sizes  of  the  clusters  to  be  recovered.  Our 
interactive  clustering  approach  is  particularly  powerful  when  one  must  recover  large  clusters  at 
the  top  levels  of  the  hierarchy,  and  small  clusters  deeper  in  the  hierarchy.  In  this  case,  the  in¬ 
teractive  algorithms  developed  here  significantly  improve  on  non-interactive  approaches  in  both 
computational  and  statistical  efficiency. 
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Chapter  4 


Interactive  Latent  Tree  Metric  Learning 


Knowledge  of  a  network’s  topology  and  internal  characteristics  such  as  delay  times  and  losses  is 
crucial  to  maintaining  seamless  operation  of  network  services.  Yet  typical  networks  of  interest 
are  incredibly  large  and  decentralized  so  that  these  global  properties  are  not  directly  available, 
but  rather  must  be  inferred  from  a  small  number  of  indirect  measurements.  Network  tomogra¬ 
phy  [  15,  167]  is  a  promising  approach  that  aims  to  gather  such  knowledge  using  only  end-to-end 
measurements  between  nodes  at  the  periphery  of  a  network  with  limited  cooperation  from  core 
routers.  The  design  of  algorithms  that  reliably  and  accurately  recover  network  characteristics 
from  these  measurements  is  an  important  research  direction. 

Most  current  methods  focus  on  single  source  network  tomography;  they  use  similarity  of  delay 
or  similarity  of  loss  measurements  from  a  single  source  to  multiple  nodes,  caused  by  shared  path 
segments,  to  infer  a  tree  topology  between  the  source  and  end  nodes.  The  assumption  of  a  tree 
topology  is  justified  under  the  premise  of  shortest  path  routing  from  the  source  to  each  end  node. 
These  procedures  either  rely  on  infrequently  deployed  multicast  probes  ([  ,  ,  80,  8  ])  or  use 

a  series  of  back-to-back,  carefully  coordinated,  unicast  probes  ([56,  82,  86,  135,  166]),  which 
makes  the  method  sensitive  to  packet  re-orderings  and  asynchrony  between  end  nodes.  These 
issues  limit  the  applicability  of  single  source  tomography  methods. 

Multiple  source  network  tomography  is  an  alternative  approach  that  uses  measurements  between 
pairs  of  end  nodes  that  form  an  additive  metric  on  a  graph.  Several  network  measures  such  as 
end-to-end  delay,  loss,  or  hop  counts  between  pairs  of  end  nodes  form  an  (approximate)  additive 
metric,  as  a  path  measurement  is  the  sum  of  the  measure  along  links  constituting  the  path.  It  is 
possible  to  learn  such  metrics  using  light-weight  measurement  such  as  hop  counts  extracted  from 
packet  headers  [84]  or  pings.  If  the  given  measurements  form  an  additive  metric  on  an  acyclic 
or  tree  graph,  a  variety  of  methods  can  be  used  to  reconstruct  the  underlying  structure  [  08,  135, 
].  However,  typically,  the  underlying  graph  is  not  an  exact  tree  as  peering  links  between 
different  network  providers  introduce  cycles  and  violate  the  tree  assumption,  again  limiting  the 
effect  of  existing  methods. 

Given  the  size  and  complexity  of  the  Internet,  the  practicality  of  any  network  tomography  al- 
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gorithm  should  be  evaluated  not  only  by  its  noise  tolerance  and  robustness  to  violations  of  any 
modeling  assumptions,  but  also  by  its  measurement  or  probing  complexity  (the  number  of  mea¬ 
surements/probes  needed  as  a  function  of  the  number  of  end  hosts  in  the  network).  State-of-the- 
art  methods  for  both  single-  and  multi-source  network  tomography  typically  suffer  in  at  least 
one  of  these  directions.  Many  methods  do  not  optimize  and/or  provide  rigorous  guarantees  on 
the  number  of  measurements  needed  to  recover  the  underlying  graph  structure,  while  others  are 
not  guaranteed  to  be  robust  to  noise  in  these  measurements.  Moreover,  to  the  best  of  our  knowl¬ 
edge,  no  method,  with  the  exception  of  [  ,  42],  consider  violations  of  the  assumption  that  the 
underlying  topology  is  a  tree.  In  this  chapter,  we  address  all  of  these  deficiencies. 

Unfortunately,  additive  metrics  can  be  unidentifiable  given  just  pairwise  distance  measurements, 
and  therefore  one  must  impose  some  structural  restrictions.  Motivated  by  recent  work  [  42] 
showing  that  internet  latency  and  bandwidth  can  be  well  approximated  by  path  lengths  on  trees, 
our  work,  much  like  existing  network  tomography  results,  is  grounded  in  a  tree  metric  assump¬ 
tion.  However,  we  introduce  two  models  to  capture  violations  of  this  assumption:  (a)  an  additive 
noise  model,  where  all  measurements  are  corrupted  by  additive  subgaussian  noise,  resulting  in 
small  deviations  from  the  tree  metric,  and  (b)  a  persistent  noise  model  in  which  a  fraction  of  the 
measurements  are  arbitrarily  corrupted.  The  persistent  noise  model  also  captures  the  effects  of 
missing  measurements  due  to  dropped  packets  or  unresponsive  nodes.  Even  under  these  noise 
models,  our  algorithms  have  strong  guarantees  about  correctness  and  measurement  complexity. 

Specifically,  we  present  two  algorithms  that  use  interactively  selected  light-weight  probes  to 
construct  a  weighted  tree  whose  path  lengths  provide  a  faithful  representation  of  the  pairwise 
measurements  between  end  hosts  in  the  network.  While  the  additional  nodes  in  the  resulting 
tree  need  not  correspond  to  hidden  network  elements,  such  a  representation  enables  distance  ap¬ 
proximations  between  unmeasured  hosts,  closest  neighbor/server  selection,  and  topology-aware 
clustering  all  of  which  can  improve  performance  of  network  services. 

Our  contributions  can  be  summarized  as  follows: 

1.  We  present  algorithms  for  the  multi-source  network  tomography  problem  that  improve 
on  existing  work  in  at  least  one  of  two  regards:  our  algorithms  have  strong  correctness 
guarantees  in  the  presence  of  noisy  measurements,  which  can  capture  violations  of  the 
tree-metric  assumption,  and,  by  intelligent  use  of  light-weight  probes,  they  come  with 
bounds  on  probing/measurement  complexity. 

2.  Our  first  algorithm  addresses  the  additive  noise  model.  It  uses  0(jpl  log2p)  pairwise  mea¬ 
surements  in  the  presence  of  noise  and  0(pl  log//)  measurements  in  the  absence  of  noise, 
where  p  is  the  number  of  end  hosts  in  the  network  and  l  is  the  maximum  degree  of  any 
node,  to  construct  a  tree  that  accurately  reflects  the  measurements.  As  our  guarantees  hold 
even  for  highly  unbalanced  tree  structures,  this  improves  on  existing  work  [86,  1 35]  that 
requires  balanced-ness  restrictions. 

3.  Under  the  persistent  noise  model,  our  second  algorithm  uses  O  (pi  log2  p)  pairwise  mea¬ 
surements  to  construct  a  tree  approximation,  even  when  a  fixed  fraction  of  the  measure¬ 
ments  are  arbitrarily  corrupted.  Robustness  to  persistent  noise,  however,  comes  at  the  cost 
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of  requiring  some  balanced-ness  of  the  underlying  tree. 

This  chapter  also  lends  evidence  to  the  three  overarching  themes  of  this  thesis.  While  we  will 
not  fully  characterize  non-interactive  approaches  for  these  problems,  we  will  see  that  our  inter¬ 
active  procedures  are  statistically  and  computationally  efficient  in  comparison  with  naive  non¬ 
interactive  procedures.  We  will  also  see  that  uniformity,  measured  by  the  degree  of  the  underly¬ 
ing  tree  affects  the  statistical  efficiency,  so  that  our  interactive  algorithms  are  particularly  suited 
for  non-uniform  (low-degree)  problems. 

This  chapter  is  organized  as  follows.  Section  4.1  discusses  related  work  and  comparisons  to 
our  algorithms.  We  provide  background  definitions  and  formally  specify  the  multi- source  to¬ 
mography  problem  in  Section  4.2.  Our  first  algorithm  that  uses  selective  pairwise  measurements 
to  recover  an  unrooted,  unbalanced  tree  topology  is  presented  in  Section  4.3.1,  along  with  an 
analysis  of  its  measurement  complexity  and  tolerance  to  additive  noise  corrupting  the  measure¬ 
ments.  In  Section  4.3.2,  we  present  our  main  algorithm,  Rising  (Robust  Identification  using 
Selective  Information  of  Network  Graphs)  and  analyze  its  robustness  to  persistent  noise  as  well 
as  its  measurement  complexity.  We  validate  the  proposed  algorithms  using  simulations  as  well 
as  real  Internet  measurements  from  the  King  [98]  and  IPlane  datasets  [129]  in  Section  4.4  and 
provide  proofs  in  Section  4.5.  We  conclude  in  Section  4.6. 


4.1  Related  Work 


Initial  work  towards  mapping  the  Internet  was  based  on  injecting  TTL  (Time-to-Live)-limited 
probe  packets  called  traceroutes  that  record  the  exact  path  traversed  by  the  packet  [73,  159]. 
Since  trace  route  is  based  on  augmenting  Time-To-Live  information  in  packets,  trace  route- 
based  tomography  approaches  are  inconsistent  when  there  are  several  paths  between  two  net¬ 
work  elements.  Moreover,  anonymous  routers  [  ]  and  router  aliases  [99]  do  not  augment 

packet  headers,  and  firewalls  as  well  as  network  address  translation  (NAT)  boxes  simply  block 
traceroute  packets,  posing  significant  challenges  to  traceroute-based  tomography. 

Among  the  various  algorithms  for  single-source  tomography,  two  recent  methods  are  particularly 
relevant  to  our  work:  the  DFS-ordering  algorithm  of  Eriksson  et  al.  [86]  and  the  work  of  Ni  et 
al.  [  35].  The  first  provably  uses  0(pl  logp)  probes  to  recover  a  balanced  l- ary  tree  topology; 
however,  the  authors  make  no  claims  about  the  correctness  of  the  algorithm  in  the  presence  of 
noisy  measurements.  Ni  et  al.  present  the  Sequential  Logical  Topology  (SLT)  algorithm,  that 
uses  0 (pi  logp)  (0(pl  log2  p)  under  additive  noise)  probes  to  recover  balanced  l- ary  trees  while 
also  guaranteeing  correct  recovery  of  the  topology  when  measurements  are  corrupted  by  additive 
noise.  Our  first  algorithm  improves  on  the  work  of  Ni  et  al.  by  relaxing  the  balanced-ness 
assumption  while  maintaining  the  same  measurement  complexity. 

In  multi-source  tomography,  a  number  of  algorithms  [6  ,85,  90]  find  Euclidean  or  non-Euclidean 
embeddings  that  accurately  reflect  the  measurements.  While  some  of  these  algorithms  have 
strong  measurement  complexity  guarantees  [8.  ],  they  do  not  capture  the  inherent  hierarchi- 
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cal  structure  of  the  network  and  thus  may  be  less  useful  than  algorithms  that  recover  tree  or 
more  intuitive  models.  In  addition  to  the  embedding-based  algorithms,  the  work  of  Rabbat  and 
Nowak  [  ]  casts  the  multi-source  tomography  problem  as  a  set  of  statistical  hypothesis  test 

that  differentiates  topological  structure  between  two  senders  and  two  receivers.  While  their  ap¬ 
proach  is  algorithmically  more  straightforward,  they  only  identify  the  presence  of  a  shared  link 
between  the  senders  and  the  receivers  and  cannot  distinguish  all  possible  topological  configura¬ 
tions  between  four  end  hosts  as  we  can. 

If  the  measurements  form  an  additive  tree  metric,  then  a  host  of  algorithms  could  be  used  to 
build  a  tree  representation  [  8,  ,  45],  some  coming  with  measurement  complexity  bounds. 

However,  the  tree  metric  assumption  does  not  hold  in  practice,  and  as  shown  in  [  ],  network 

measurements  such  as  latency  and  bandwidth  only  approximate  additive  tree  metrics.  It  is  conse¬ 
quently  important  to  design  algorithms  that  are  robust  to  violations  of  the  tree  metric  properties. 

Sequoia  [  ]  is  one  algorithm  designed  for  this  purpose.  Unfortunately,  it  comes  with  no  guar¬ 

antees  on  correctness  in  the  presence  of  these  violations,  and  while  it  seems  to  use  only  a  limited 
number  of  probes  in  practice,  it  lacks  measurement  complexity  bounds.  In  this  paper,  we  build 
on  this  line  of  work  by  designing  an  algorithm  with  theoretical  guarantees  on  correctness  and 
measurement  complexity.  Another  method  that  addresses  more  general  graph  structures,  beyond 
trees,  was  proposed  recently  in  [  ].  However,  this  method  also  does  not  optimize  the  measure¬ 
ment  complexity. 

Our  work,  and  network  tomography  in  general,  have  strong  connections  to  the  task  of  learning 
the  structure  of  latent  variable  graphical  models  and  to  problems  in  phylogenetic  inference.  For 
example,  in  [  ]  and  [5  ],  algorithms  are  proposed  to  learn  tree-structured  graphical  models 

using  pairwise  empirical  correlations  obtained  from  measurements  of  variables  associated  with 
leaf  nodes.  Under  this  setup,  the  correlations  form  an  exact,  rather  than  approximate,  tree  metric. 
Moreover,  due  to  the  different  measurement  model,  this  work  does  not  explicitly  optimize  the 
number  of  pairwise  measurements  used.  Our  first  algorithm  is  indeed  based  on  the  work  of  Pearl 
and  Tarsi  [13  ]  and  hence  we  call  it  PearlReconstruct. 

In  phylogenetics,  the  task  of  learning  an  evolutionary  tree  using  genetic  sequence  data  from 
several  extant  species  is  closely  related  to  the  single-source  tomography  problem.  Several  algo¬ 
rithms,  such  as  the  neighbor-joining  algorithm  [94, 135,  151]  have  been  applied  to  both  problems. 
Also  see  [  ],  [68],  and  [88]  for  more  details.  To  the  best  of  our  knowledge,  the  algorithms  we 

propose  are  novel  and  do  not  exist  in  the  phylogenetics  literature. 


4.2  Background 

Let  X  =  {xi}pi=l  denote  the  end  hosts  in  a  network  and  let  d  :  X  x  X  — >•  M+  be  a  function 
representing  the  true  distances  between  the  nodes,  so  that  d(xi,  x:) )  is  the  distance,  as  measured 
in  the  network,  between  the  hosts  x%  and  Xj. 
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Figure  4.1:  Possible  structures  for  four  leaves  in  a  tree.  If  d(w,  x)  +  d(y,  z)  <  d{w,y)  +  d{x,z)  = 
d(w,z )  +  d(x,y)  then  structure  and  labeling  is  that  of  (a).  If  d(w,x )  +  d(y,z )  =  d(w,y)  + 
d(x,  z )  =  d(w,  z )  +  d(x,  y)  then  structure  is  a  star  (b). 


Our  work  focuses  on  distance  functions  d  that  form  approximate  additive  tree  metrics.  Specif¬ 
ically,  let  T  =  (V,  £,  c)  be  a  weighted  tree  with  vertices  V,  edges  £  and  weights  c,  for  which 
X  is  the  set  of  leaves.  To  avoid  identifiability  issues,  our  focus  will  be  on  minimal  trees,  for 
which  each  internal  node  has  degree  >  3  and  each  edge  has  strictly  positive  weight.  An  additive 
tree  metric  on  A  is  a  function  dT  such  that  dq-{xi:xf)  =  Y^(Xk  Xl)ePath(Xl  *  )  cixk,  xi),  that  is  the 
distance  between  two  points  is  the  sum  of  the  edge  weights  along  the  unique  path  between  them. 
A  useful  property  of  additive  tree  metrics  is  the  four-point  condition: 

Definition  4.1.  A  metric  (X,  d)  satisfies  the  four-point  condition  (4PC)  if  for  any  set  of  points 
w,x,y,z  G  X  ordered  such  that  d(w,x )  +  d(y,z )  <  d(w,y )  +  d(x,z )  <  d(w,z )  +  d(x,y), 
d(w,  y)  +  d(x,  z)  =  d(w,  z)  +  d(x,  y ). 

The  4PC  is  related  to  the  quartet  test ,  a  common  technique  for  resolving  tree  structures  (Indeed, 
there  are  a  host  of  quartet-based  algorithms  for  phylogenetic  inference,  for  example  [  ]).  The 

quartet  test  is  used  to  identify  the  structure  between  any  four  leaves  in  a  tree  using  only  the 
pairwise  distances  between  those  leaves.  It  is  easy  to  see  that  any  four  leaves  either  form  a 
structure  like  that  in  Figure  4.1(a)  or  a  star  (Figure  4. 1  (b)),  and  using  the  4PC  we  can  identify  not 
only  which  structure  but  also  the  correct  labeling  of  the  leaves  (See  Figure  4.1  for  more  details). 

Any  metric  that  satisfies  the  four-point  condition  is  a  tree  metric  for  some  tree.  Unfortunately, 
latency  and  hop  counts  in  real  networks  do  not  exactly  fit  into  this  framework,  but  only  approxi¬ 
mate  tree  metrics  [  ].  One  characterization  of  this  approximation  is  the  4PC-e  condition  which 

requires  d(w,  z)  +  d(x,  y)  <  d(w ,  y)  +  d{x,  z)  +  2e  min {d(w,  x),  d(y ,  z)}  for  some  parameter  e 
instead  of  the  equality  in  Definition  4.1.  Metrics  for  which  e  values  are  low  can  be  well  approx¬ 
imated  by  tree  metrics,  and  empirical  studies  showing  that  real  network  measurements  satisfy 
4PC-e  for  small  values  of  e  motivates  the  use  of  this  model. 

In  this  work,  we  take  a  more  statistical  approach  and  instead  assume  that  d(xi ,  xf)  =  dT(xt,  xf)  + 
g(xi,  Xj )  where  the  function  g  models  the  networks  deviations  from  a  tree  metric.  This  approach 
allows  us  to  not  only  formally  state  the  multi-source  network  tomography  problem  but  also  to 
make  rigorous  guarantees  about  the  performance  of  our  algorithms.  We  focus  on  two  models  for 
these  deviations: 

1.  Additive  Noise  Model  -  In  this  model,  g(xi,  xf)  is  drawn  from  a  subgaussian  with  a2  as 
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Figure  4.2:  CDFs  of  e  values  in  the  4PC-e  condition  for  two  real  world  datasets  (King  [98] 
and  IPlane  datasets  [129])  along  with  a  dataset  of  points  drawn  uniformly  from  the  surface  of  a 
sphere,  where  geodesic  distance  defines  the  metric. 


a  scale  factor1.  The  small  perturbation  model  studied  in  single  source  network  tomogra¬ 
phy  (See  for  example  [  >5])  is  similar  to  this  as  subgaussian  noise  is  bounded,  with  high 
probability,  by  a  small  constant  (depending  on  a2).  This  model  captures  the  inherent  ran¬ 
domness  in  certain  types  of  measurements,  such  as  latencies.  Under  this  formulation  we 
allow  each  measurement  to  be  observed  several  (n)  times. 

2.  Persistent  Noise  Model  -  Here  g (xi7  Xj)  —  0  with  probability  q,  independent  of  all  other 
Xi  and  Xj,  and  with  probability  1  —  q,  g(xi,  Xj)  is  arbitrary  (or  adversarially)  chosen.  We 
believe  this  is  a  reasonable  model  of  how  the  measurements  do  not  exactly  form  a  tree 
metric,  due  to  violations  caused  by  peering  links,  unresponsive  nodes  or  missing  measure¬ 
ments.  To  more  accurately  model  violations  of  tree  metric  assumptions,  multiple  request 
for  a  measurement  all  reveal  the  same  (possibly  incorrect)  value,  so  we  only  obtain  one 
sample  of  each  measurement.  To  the  best  of  our  knowledge,  there  are  no  other  efforts  to 
study  this  noise  model. 

While  [  ]  capitalized  on  the  fact  that  ~  80%  of  the  quartets  satisfy  4PC  with  a  small  pertur¬ 

bation  e,  we  also  note  that~  20%  of  the  quartets  do  not  satisfy  the  4PC  even  with  e  —  1,  which 
corresponds  to  triangle  inequality  violations  (See  Figure  4.2  where  we  plot  the  CDF  of  e  values 
for  two  real-world  datasets).  We  attempt  to  address  both  of  these  phenomena  with  our  two  noise 
models:  additive  noise  to  capture  the  small  deviations  from  4PC  and  persistent  noise  to  capture 
the  larger  perturbations.  In  this  chapter,  we  addresses  these  two  types  of  noise  separately,  but 
note  that  our  second  algorithm  can  be  modified  to  handle  both  types  of  noise  simultaneously. 

We  are  now  prepared  to  formally  specify  our  problem: 

Problem  4.1.  Given  a  noisy  metric  space  (X,  d )  equipped  with  a  noisy  metric  d  =  dr  +  g  for 
some  tree  T,  recover  T  and  dp  while  minimizing  the  number  of  measurements  of  d. 

In  this  chapter,  we  develop  algorithms  for  this  problem  under  the  assumption  that  g  corresponds 
to  one  of  the  models  above.  We  first  define  several  quantities  that  appear  in  the  sequel.  For  any 
tree  T,  let  lvs(T)  denote  the  set  of  leaf  nodes  of  T  and  let  deg(T)  denote  the  maximum  degree 

'A  random  variable  X  is  subgaussian  with  scale  factor  a1  if  P(exp(LY))  <  exp(a2f2/2).  This  family  encom¬ 
passes  both  gaussian  and  bounded  random  variables. 
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Algorithm  6  PearlReconstruct(A’,  d,  7) 
Initialize  T3  as  a  star  tree  on  27,  x2,  27 
for  i  =  4 ...  p  do 

Ti  =  PearlAdd(xi,  1,  d,  7) 

end  for 

Return  Tp 


of  the  tree.  For  convenience  will  we  define  l  =  deg(T). 


For  any  three  nodes  x,  y,  and  2  in  a  tree,  let  ancestor^,  y,  2 )  be  the  unique  node  that  is  the 
shared  common  ancestor  of  x ,  y  and  2.  This  node  is  the  unique  point  along  which  the  three  paths 
between  all  pairs  of  x,  y,  and  2  intersect  and  distances  to  this  point  can  be  computed  by  (where 
a  =  ancestor (x ,  y,  2)): 


dr(x ,  a)  = 


1 

2 


( dr(x ,  y)  +  dr{x ,  2)  -  dT(y,  2)) 


(4.1) 


To  avoid  propagation  of  additive  noise  in  ancestor  computations,  we  only  use  distances  between 
true  leaf  nodes  (nodes  in  X).  To  compute  the  ancestor  and  associated  distances  between  three 
nodes  x,  y,  2,  some  of  which  may  not  be  leaves,  we  use  a  surrogate  leaf  node  for  each  non-leaf  in 
the  computation.  A  surrogate  leaf  node  for  x  is  one  for  which  x  is  on  the  path  between  that  leaf 
and  both  y  and  2.  The  restriction  to  minimal  trees  guarantees  existence  of  surrogate  leaf  nodes. 


4.3  Algorithms 


We  now  describe  two  algorithms  for  multi-source  network  tomography  and  present  guarantees  on 
correctness  and  measurement  complexity.  Our  first  algorithm,  PearlReconstruct  addresses 
the  additive  noise  model  while  our  second.  Rising  addresses  the  persistent  model. 


4.3.1  Additive  Noise 

The  idea  behind  our  first  algorithm  is  to  construct  the  tree  T  by  iteratively  attaching  the  leaves. 
To  add  leaf  27,  we  perform  an  intelligent  search  to  find  a  pair  of  nodes  27,27  such  that  the 
distance  between  2 7  and  ancestor(ay,  27,27)  is  minimized.  This  information,  along  with  the  fact 
that  27  is  not  in  the  same  subtree  as  either  2 7  or  27  (which  we  also  determine),  tell  us  how  to  add 
x,  to  the  tree. 

Our  search  is  intelligent  in  that  we  choose  27  and  27  to  rule  out  large  portions  of  the  tree  at  every 
step.  Specifically,  by  choosing  a  point  with  fairly  balanced  subtrees  (known  as  the  pearl  point), 
we  can  determine  which  of  these  subtrees  ay  belongs  to  and  focus  our  search  to  a  subtree  that 
is  a  fraction  of  the  original  size,  using  a  constant  number  of  measurements.  Formally,  for  any 
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Algorithm  7  PearlAdd(o7,  Ti- 1,  d,  7) 


Tc 


Ti—  1 


while  |lvs(Tc)|  >  2  do 

Choose  a  subtree  Tout  such  that: 
sggL  <  |lvs(Tc)  \lvs(Tral)|  < 
r  parent  of  Tnnt  in  Tc 

Let  Tsub  7^  Tout  be  any  other  subtree  of  Tc  rooted  at  r  and  choose  xk  G  lvs(Tsnfe),  xy  e 


lvs(T01rf). 

y  ancestor(ay,  Xj,xk),  compute  d(xt ,  y),  d(x3 ,  y),  and  d(xfe,  ?/),  using  surrogates. 

If  d(xj,  y)  +  7/2  <  d(xj,  r ),  then  Tc  <—  Tout  U  {r} 

If  d(xfc,  r/)  +  7/2  <  r),  then  Tc  ■<—  Tsu6  U  {r} 

Otherwise  TC^TC\  {Tsub  U  Tout} 

end  while 
if  |TC|  =  1  then 

Attach  Xi  to  Tc  with  edge  length  d{xi ,  y). 
else 

Tc  has  two  nodes  r  and  r'.  Choose  leaves  xk  and  x3  such  that  r  is  on  the  path  between  xk 
and  r' ,  and  r'  is  on  the  path  between  x3  and  r. 
y  ancestor (x*,  xk,  x3). 

If  | d{xk,  y)  —  d(xk,  r)\  <  7/2,  then  attach  xt  to  r. 

If  \d(xj,y)  —  d(xj,r')\  <  7/2,  then  attach  x*  to  r' . 

Otherwise,  insert  y  between  r  and  r'  (with  edge  weights  d(xk,  y)  —  d(xk,  r )  and  d(xj ,  y)  — 
d(xj ,  r'))  and  attach  xt  to  y  with  edge  weight  d(xi:  y). 

end  if 

Return  Tt_i  updated  to  include  xr. 


directed  instance  of  a  tree  T,  the  pearl  point  is  the  internal  node  in  a  tree  for  which  the  number  of 
leaves  below  that  node  is  between  |lvs(T)|/(deg(T)  + 1)  and  |lvs(T)|deg(T)/(deg(T)  + 1).  As 
we  show,  using  the  pearl  point  results  in  a  strong  upper  bound  on  the  number  of  measurements 
used  while  ensuring  correctness  of  the  algorithm.  As  the  algorithm  carefully  chooses  which 
pairwise  distances  to  query,  our  algorithm  is  interactive. 

PearlReconstruct  is  related  to  the  algorithm  in  [  37],  the  Sequential  Logical  Topology 
(SLT)  algorithm  [  ],  and  the  Sequoia  algorithm  [  ].  Our  search  parallels  that  of  [  ], 

but  by  using  triplet  tests  rather  than  quartet  tests  and  by  incorporating  slack  into  our  search, 
PearlReconstruct  is  robust  to  additive  noise  while  their  algorithm  is  not.  On  the  other 
hand,  the  SLT  algorithm  is  robust  to  noise,  but  they  do  not  begin  their  search  at  the  pearl  point  of 
the  tree,  and  thus  their  measurement  complexity  guarantees  only  hold  for  balanced  trees,  while 
our  guarantees  are  more  general.  The  Sequoia  algorithm  also  adopts  some  of  the  same  ideas,  but 
since  their  search  is  heuristic,  they  do  not  provide  bounds  on  the  number  of  measurements  used. 

The  algorithm  involves  a  parameter  7  that  is  a  lower  bound  on  the  edge  weights  in  the  true  tree 
T.  This  parameter  is  critical  for  identifying  two  nodes  separated  by  a  short  edge  in  the  presence 
of  noise  and  is  a  robust  version  of  the  minimality  condition.  Similar  parameters  have  been  used 
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in  related  results  [  35]. 

Pseudocode  for  PearlReconstruct  is  shown  in  Algorithms  6  and  7.  Our  main  correctness 
guarantee  is  the  following;  proof  of  the  result  is  deferred  to  Section  4.5. 

Theorem  4.1.  Let  (A,  d)  be  a  noisy  metric  space  with  \ X\  =  p  where  d  —  d-j  +  g  for  a  tree  T 
with  minimum  edge  length  >  7  and  consider  the  additive  noise  model  with  scale  factor  at  most 
cr2.  Fix  any  5  G  (0, 1).  If,  for  each  pairwise  distance  queried,  PEARLRECONSTRUCT  uses  the 
average  over  n  samples  and 

2 

n  >  18^1og(2 p2/5),  (4.2) 

7 

then  with  probability  at  least  1  —  5,  PearlReconstruct  successfully  recovers  T  and  the  edge 
weights  in  the  tree  with  at  most  7/2  additive  error. 

This  theorem  is  a  correctness  guarantee  for  PearlReconstruct.  In  the  absence  of  noise,  the 
algorithm  always  succeeds  in  recovering  the  tree  topology  T  along  with  all  pairwise  distances  in 
the  metric  d-j.  In  the  additive  noise  model,  the  algorithm  fails  with  some  probability  5,  but  with 
the  remaining  probability  it  recovers  the  tree  topology  and  the  edge  weights  in  the  tree  with  error 
at  most  7/2.  This  implies  accurate  recovery  of  all  of  the  pairwise  distances  in  the  tree,  where  the 
level  of  accuracy  for  any  distance  is  linear  in  the  number  of  edges  between  the  two  nodes. 

It  remains  to  bound  the  total  number  of  measurements  used  by  the  algorithm.  The  following 
theorem  upper  bounds  this  quantity. 

Theorem  4.2.  PearlReconstruct  uses  0(pl ^  log2p)  pairwise  measurements. 

For  constant-degree  tree  metrics,  we  see  that  the  algorithm  uses  a  slightly  super-linear  number  of 
measurements.  This  is  a  polynomial  improvement  over  a  naive  algorithm  that  would  repeatedly 
query  for  all  pairwise  distances  and  average  away  noise.  This  naive  algorithm  would  use  0(p2 
measurements,  which  is  quadratic  in  the  network  size.  By  making  measurements  in  an  interactive 
fashion,  we  obtain  a  significantly  reduced  sampling  requirement. 

Note  that  this  bound  also  leads  to  a  bound  on  the  running  time  of  the  algorithm.  For  each  node 
we  insert,  we  compute  the  pearl  point  and  perform  quartet  tests  at  most  0(1  log(p)  )  times.  Since 
the  pearl  point  can  be  computed  in  linear  time,  the  algorithm  runs  in  O (p2 /poly log (p))  time. 


4.3.2  Persistent  Noise 

For  the  persistent  noise  model,  we  propose  a  divisive  algorithm;  it  recursively  partitions  the 
leaves  into  groups  corresponding  to  subtrees  of  T.  Each  partitioning  step  identifies  one  internal 
node  in  the  tree,  and  by  repeated  applications  of  our  algorithm,  we  identify  all  internal  nodes  that 
satisfy  certain  properties  (detailed  in  Theorem  4.3). 

A  top-down  partitioning  algorithm  allows  us  to  use  voting  schemes  that  are  robust  to  persistent 
noise.  Specifically,  we  identify  groups  of  nodes  by  repeatedly  performing  quartet  or  triplet  tests 
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Algorithm  8  Rising^,  d,  m) 

Randomly  choose  M  C  X  with  \M\  —  m 

For  Xi,Xj  G  M,  compute  s(xi,Xj )  =  maxXkeM  \{xk'  G  M  :  d(xi,xk)  —  d(xj,Xk)  = 
d(xhxki)  -  d(xj,Xk>)}\ 

Run  Single  Linkage  Clustering  using  similarity  function  s  to  cluster  M  into  C  with  \C\  =  3. 
for  Xi  G  X  \  M  do 
VOTEfe,  C,  d) 

end  for 

Initialize  T  with  1  node  r 

for  C  G  C  do 

Tsub  <—  SplitCC,  X\C,d ,  m). 

Choose  clusters  Ci,  C2  G  C  \  C 

weight(r,  root (Tsub))  <—  EdgeLength(Ci,  C2,  Tsubl  d) 

end  for 

Return  T 


and  deciding  on  the  structure  agreed  on  by  the  majority.  However,  to  ensure  that  these  groups 
are  sufficiently  large,  we  require  a  balancedness  condition: 

Definition  4.2  (Balance  Factor).  We  say  that  T  has  balance  factor  rj  ifr)  is  the  smallest  number 
for  which  there  exists  a  node  r  such  that  for  all  internal  nodes  h  (including  r),  with  subtrees 
T  |  (If. . . . ,  Tk(h)  directed  away  from  r: 

i  >  rnaxj  \lvs(Tj(h))\ 

11  ~  min,  \lvs(Ti(h))\  ' 

To  identify  a  single  internal  node  r  our  algorithm  randomly  samples  a  subset  of  the  leaves,  forms 
a  clustering  of  this  subset,  and  then  places  each  remaining  leaf  into  one  cluster.  After  recursively 
partitioning  each  cluster,  we  compute  edge  lengths  using  a  voting  scheme.  In  the  clustering 
phase,  we  compute  a  similarity  function  s  on  the  sampled  leaves  where  s(xi,  xf)  is  large  if  the 
two  leaves  belong  in  the  same  subtree  of  T,  viewed  with  r  as  the  root.  We  partition  the  sampled 
nodes  into  two  clusters  in  most  cases  (to  find  the  first  split  we  partition  into  three).  Each  of  these 
clusters  is  comprised  of  leaves  from  one  or  more  subtrees  rooted  at  r,  but  the  leaves  from  any  of 
the  subtree  are  contained  wholly  in  one  cluster. 

Once  we  have  clustered  the  sampled  nodes,  we  use  voting  to  determine  the  group  assignments 
for  the  remaining  nodes.  To  place  a  node  x^,  we  compute  quartet  structures  (See  Figure  4.1) 
between  Xi  and  Xj,xk,  (each  from  different  clusters)  and  record  which  node  xt  paired  with  in 
the  quartet  test.  We  place  x,t  into  the  cluster  that  most  commonly  paired  with  xt. 

The  computations  required  to  find  the  initial  partition  of  leaves  are  slightly  different  from  those 
required  for  subsequent  splits.  To  highlight  these  differences,  we  present  pseudocode  for  re¬ 
covering  the  first  partition  in  Algorithm  8  and  for  subsequent  partitions  in  Algorithm  9.  These 
algorithms  rely  on  two  subroutines  which  we  show  in  Algorithms  10  and  11. 

Before  presenting  our  theoretical  guarantees,  we  remark  that  while  our  results  analyze  RISING  in 
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Algorithm  9  Split(5,  y,  d,  m) 

Randomly  choose  M  C  S  with  \M\  =  m 
For  each  Xk  G  M,  draw  Z(k)  randomly  from  y. 

For  Xi,Xj  G  M,  compute  s{x^Xj)  =  \{xk  G  M  :  d(xi,Xk )  —  d(xj,Xk )  =  d(xi,xz{k ))  — 
d(xj,xz(k))}\. 

Run  Single  Linkage  Clustering  using  similarity  function  s  to  cluster  M  into  C  with  \C\  =  3. 
for  Xi  G  S  \  M  do 
VOTE(xi,  C  U  {y},d) 

end  for 

Initialize  T  with  1  node  r 

for  C  G  C  do 

Tsub  <—  Split(C,  y\J(S\C),d,  m). 

Choose  C'  G  C\C 

weight(r,  root (Tsub))  <—  EdgeLength(C",  y,  Tsub,  d) 

end  for 

Return  T 


Algorithm  10  VOTE(x,  C,  d) 

Let  Ci,  C2,  C3  G  C 
VCX,  VC *2,  VC3  <—  0 
for  n  G  {1, ,  minCeC  |C|}  do 
Choose  x\  G  Ci,  X2  G  C2,  a:3  G  C3. 

EC*  CCj  +  1  if  x  pairs  with  xt  w.r.t.  the  other  two. 
If  Xi ,  xi,  x2,  x3  form  a  star,  ignore  this  vote. 

end  for 

Place  x  in  6',  where  VCi  =  argmax{  VX'i ,  VX'2,  CC3} 


the  presence  of  only  persistent  noise,  with  slight  modifications  the  algorithm  can  be  made  robust 
to  both  persistent  and  additive  noise.  The  main  change  would  involve  incorporating  slack  into  the 
quartet  tests,  much  like  we  have  done  in  PearlReconstruct.  The  analysis  for  this  modified 
algorithm  would  incorporate  the  techniques  used  in  Theorem  4. 1  (specifically  concentration  of 
subgaussian  random  variables)  into  our  current  proofs.  However,  for  clarity  of  presentation,  our 
analysis  guarantees  the  correctness  of  Rising  under  only  persistent  noise. 

Theorem  4.3.  Let  (X,  d)  be  a  metric  where  d  =  d^  +  gfor  a  tree  T  with  bounded  balance  factor 
7]  and  where  g  is  from  the  persistent  noise  model  with  probability  of  an  uncorrupted  entry  >  q 
with  (f  >  Crij,  Then  with  probability  >  1  —  1/p,  every  execution  of  RISING  and  SPLIT,  with 
parameter  m,  will  correctly  identify  an  internal  node  provided  that: 


m  >  cVti 


log  (pm) 

0 9 6  -  Cv,i)2 


where  1/2  <  C7hj  <  1,  c,hi  are  constants  depending  on  rj  and  l. 


(4.3) 


This  theorem  is  a  correctness  guarantee  for  the  Rising  algorithm,  although  the  flavor  of  guaran- 
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Algorithm  11  EdgeLength(Ci,  Cr2 1  Tsub,  d ) _ 

Cz,  G-  leaves  in  one  subtree  of  Tsub 
CR  <—  leaves  in  another  subtree  of  Tsuh 
for  n  G  {1,  •  •  •  min{m,  |Ci|,  |C2|,  \CL\,  \CR\}  do 
Draw  w  E  Ci,  x  G  C'2,  y  E  Ci,z  E  CR 
Record  y)  +  d(x,  z )  —  d(tu,  x)  —  d(y,  z) 
end  for 

Return  the  most  frequently  occurring  recorded  value 


tee  is  quite  different  from  that  in  Theorem  4.1.  This  theorem  ensures  that  any  internal  node  for 
which  every  subtree  has  size  at  least  m  will  be  recovered  by  repeated  calls  to  Algorithm  9.  In 
the  absence  of  noise,  we  can  choose  m  to  be  a  function  of  <S|,  the  subset  of  leaves  passed  into 
the  Split  routine.  However,  with  noise,  m  must  be  O(logp)  and  if  S  is  too  small  for  this,  then 
S  cannot  be  further  resolved,  and  thus  logp  limits  the  recovery  resolution. 

In  Section  4.5,  we  give  a  precise  characterization  of  Crhi,  which  plays  a  critical  role  in  Rising’s 
robustness  to  noise.  While  Cvj  <  1  for  all  values  of  //  and  l,  it  grows  with  these  quantities. 
Specifically,  the  minimum  value  for  CTlj  is  1/2,  which  happens  when  //  =  1  and  l  =  2.  This 
corresponds  to  a  perfectly  balanced  binary  tree,  which  is  the  easiest  case  for  the  persistent  noise 
setting. 

We  now  upper  bound  the  number  of  measurements  used  by  the  algorithm: 

Theorem  4.4.  On  trees  with  bounded  balance  factor,  Rising  uses  0(pml\ogp)  measurements 
where  l  is  the  maximum  degree  of  the  tree  T. 

Setting  m  as  in  Theorem  4.3,  we  see  that  Rising  recovers  all  identifiable  internal  nodes  while 
using  0(pq~6  log2(p))  measurements.  Comparing  with  a  naive,  non-interactive  algorithm  that 
obtains  all  measurements,  this  is  a  polynomial  improvement  in  sample  complexity,  demonstrating 
the  power  of  interactivity  for  this  problem.  We  are  not  aware  of  any  more  sophisticated  non¬ 
interactive  approaches  for  this  setting. 


4.4  Experiments 


We  perform  several  experiments  on  simulated  and  real-world  topologies  to  assess  the  validity 
of  our  theoretical  results  and  to  demonstrate  the  performance  of  our  algorithms.  We  study  how 
increasing  noise  affects  our  algorithms  ability  to  correctly  recover  the  topology  and  also  how  the 
number  of  measurements  used  compares  to  related  algorithms. 
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1 


Probability  of  Uncorrupted  Measurement  (q) 


(a)  PearlReconstruct 


(b)  Rising 


Figure  4.3:  Noise  Thresholds  for  PearlReconstruct  and  Rising. 


Figure  4.4:  Measurements  used  as  a  function  of  p  for  PearlReconstruct,  Rising,  DFS 
Ordering  [8 6],  SLT  [  35],  and  Sequoia  [  ] 

4.4.1  Simulations 

In  simulations,  we  demonstrate  how  our  algorithms  tolerate  noise,  how  this  tolerance  scales  with 
p,  and  how  the  number  of  measurements  used  scales  with  p.  For  these  experiments,  we  generate 
tree  topologies  and  obtain  pairwise  distances  by  computing  unweighted  path  lengths  along  the 
tree  to  represent  hop  counts  in  a  network.  We  then  perturb  this  pairwise  distance  matrix  with 
additive  or  persistent  noise  and  run  our  algorithms  on  this  perturbed  matrix.  We  assess  the 
correctness  of  our  algorithms  by  computing  the  fraction  of  quartets  for  which  the  structure  in  the 
reference  tree  matches  that  in  the  algorithm’s  output. 

For  Rising,  in  simulations  we  always  choose  m  =  log2  |<S|  (even  with  noise),  which  as  men¬ 
tioned,  satisfies  the  conditions  of  Theorem  4.3  in  the  absence  of  noise.  For  our  real  world  exper¬ 
iments,  we  use  m  =  logp. 

Our  first  experiment  studies  how  PearlReconstruct  and  Rising  perform  in  the  presence  of 
noise.  In  Figures  4.3(a)  and  4.3(b)  we  plot  the  fraction  of  incorrect  quartets  averaged  over  20 
trials  for  PearlReconstruct  and  Rising  respectively,  as  a  function  of  the  noise  for  different 
values  of  p.  In  Figure  4.3(a)  we  verify  three  properties  of  PearlReconstruct:  (i)  in  the 
absence  of  noise,  it  deterministically  recovers  the  true  topology  as  predicted  by  Lemma  4.5, 
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(ii)  as  the  noise  variance  increases,  PearlReconstruct  becomes  less  accurate,  (iii)  on  larger 
topologies,  PearlReconstruct  requires  lower  noise  variance.  This  last  property  follows 
from  Equation  4.2  since  if  n  is  constant  (we  took  n  —  1  for  these  experiments),  we  require 
a2  =  O(j^)  in  order  to  guarantee  successful  recovery,  and  this  upper  bound  decreases  with  p. 

For  Rising,  in  Figure  4.3(b),  we  observe  the  opposite  phenomenon;  larger  topologies  can  toler¬ 
ate  more  persistent  noise.  This  matches  our  bounds  in  Theorem  4.3,  which  allows  q  to  approach 
a  constant  asm,p  oo.  As  before,  we  also  observe  that  in  the  absence  of  noise,  we  determinis¬ 
tically  recover  the  underlying  topology,  although  we  note  that  we  used  balanced  binary  trees  for 
these  experiments.  For  highly  unbalanced  trees,  we  cannot  make  this  deterministic  guarantee. 

To  assess  the  measurement  complexity  of  our  algorithms,  we  record  how  many  measurements 
each  algorithm  uses  as  a  function  of  p,  in  the  absence  of  noise.  These  plots  are  shown  in  Fig¬ 
ure  4.4.  As  is  noticeable  in  Figure  4.4(a),  the  measurement  complexity  for  PearlRecon¬ 
struct  appears  to  be  0(p\ogp).  We  also  show  the  measurement  complexity  for  the  DFS  Or¬ 
dering  algorithm  of  Eriksson  et  al  [86]  and  the  Sequential  Fogical  Topology  (SET)  algorithm 
[  35],  both  of  which  are  single-source  tomography  methods  with  provable  O(plogp)  complex¬ 
ity  on  balanced  trees.  The  trees  used  here  are  randomly  generated,  and  we  see  that  the  SET 
algorithm  performs  worse  that  PearlReconstruct,  while  DFS  Ordering  seems  to  use  a  con¬ 
stant  multiplicative  factor  fewer  measurements. 

However,  in  the  worst  case,  PearlReconstruct  enjoys  considerable  advantage  over  both  SET 
and  DFS  Ordering  as  can  be  seen  in  Figure  4.4(b).  In  this  experiment,  we  used  highly  unbalanced 
trees  and  we  see  that  the  measurement  complexity  of  both  SET  and  DFS  Ordering  scale  at  0  (p2 ) , 
while  PearlReconstruct  continues  to  scale  at  0 (p  log p) . 

In  Figure  4.4(c),  we  compare  Rising  to  the  Sequoia  algorithm  of  [142].  While  Sequoia  comes 
with  no  guarantees  about  correctness  or  measurement  complexity,  it  appears  to  use  very  few 
measurements  in  practice.  Rising  on  the  other  hand  appears  to  use  a  multiplicative  factor  of 
log/;  more  measurements  than  Sequoia,  which  we  confirmed  empirically.  However,  as  we  show 
in  our  real  world  experiments,  Sequoia  is  less  robust  to  noise,  even  when  customized  to  use 
a  similar  number  of  measurements  as  Rising.  We  also  emphasize  that  Rising  comes  with 
guarantees  on  correctness  in  the  presence  of  noise  while  Sequoia  does  not. 


4.4.2  Real  World  Experiments 

In  addition  to  verifying  our  theoretical  results,  we  are  interested  in  assessing  the  practical  perfor¬ 
mance  of  our  algorithms  on  real  network  tomography  datasets.  We  use  two  datasets:  the  King 
dataset  [98]  of  pairwise  latencies  and  a  dataset  of  hop  counts  between  PlanetFab  [  ]  hosts 

measured  using  iPlane  [129].  We  selected  a  500-node  subset  of  the  1740-node  King  dataset.  The 
iPlane  dataset  consists  of  193  end  hosts. 

We  ran  three  algorithms,  PearlReconstruct,  Rising,  and  Sequoia,  on  both  datasets  and 
plot  the  distribution  of  relative  error  values  for  each  algorithm.  Given  the  constructed  tree  met- 
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(a)  Relative  Error  on  King  (b)  Relative  Error  on  iPlane 


Figure  4.5:  CDF  of  relative  error  on  King  (a)  and  iPlane  (b)  datasets. 


Dataset 

Hosts 

Total 

Pearl 

Rising 

Sequoia 

King 

iPlane 

500 

194 

125250 

18721 

8321 

2480 

43608 

12309 

42599 

11574 

Figure  4.6:  Measurements  used  on  real  world  data  sets 


ric  (A",  d)  and  the  true  metric  (A,  d),  we  measure  relative  error  for  each  pairwise  distance  as 
'  This  quantity  reflects  how  well  the  tree  metric  approximates  the  true  distances 

a\Xi  j  Xj  j 

in  the  network.  These  plots  are  shown  in  Figures  4.5(a)  and  4.5(b).  We  see  that  on  both  datasets, 
Rising  outperforms  both  Sequoia  and  PearlReconstruct,  with  substantial  improvements 
on  the  King  dataset.  PearlReconstruct  performs  moderately  well  on  both  datasets. 


Lastly,  we  recorded  the  number  of  measurements  used  by  the  algorithms  on  the  two  datasets  in 
Figure  4.6.  Note  that  Sequoia  can  be  used  to  build  many  trees  where  the  recovered  pairwise 
distances  is  the  median  distance  across  all  trees.  To  ensure  a  fair  comparison,  we  build  several 
trees  so  that  Sequoia  and  Rising  use  a  similar  number  of  measurements.  However,  even  with 
several  trees,  Rising  performs  better  than  Sequoia. 


4.5  Proofs 

4.5.1  Proof  of  Theorem  4.1 

First,  we  consider  the  noiseless  scenario. 

Lemma  4.5.  Let  (X,d)  be  a  minimal  tree  metric  on  T  with  |X|  =  p.  Then  PearlRecon¬ 
struct  on  input  (A",  d)  recovers  T  and  d  exactly. 


Proof.  We  start  with  T3,  the  tree  on  leaves  x \ .  x2  and  x3.  Every  minimal  tree  on  3  leaves  has  the 
same  structure  as  T3,  so  we  know  this  is  correct.  Moreover,  since  d(xi ,  y)  for  i  e  (1,  2,  3}  and 
y  =  ancestor^,  x2,  x3)  is  given  by  Equation  4.1,  the  edge  weights  in  T3  are  also  correct. 
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We  now  analyze  the  add  procedure,  showing  that  it  correctly  places  ay  into  the  tree  so  that  Tt 
is  the  correct  minimal  tree  on  ay , . . . ,  ay  with  the  correct  edge  weights.  We  proceed  by  case 
analysis:  for  any  root  r  with  subtrees  Tout  and  Tsub,  it  must  be  that  either  ay  belongs  in  Tout,  Tsub 
or  in  Tc  \  {Tsub  U  Tout}.  For  any  xk  G  Tsub,  ay  G  Tout,  if  x%  belongs  in  Tout,  then  it  must  be  the 
case  that  d(x3 ,  y)  <  d(x3,  r)  or  else  the  shared  common  ancestor  between  ay,  ay,  and  xk  could 
not  possibly  lie  in  Tout.  Similarly,  if  ay  belongs  in  Tsub  then  it  must  be  that  d( ay,  y)  <  d(xj ,  r). 
Finally,  if  x^  lies  in  neither  subtree,  then  ancestor(ay,  ay,  xk)  =  r. 

In  each  case,  we  update  Tc  so  that  it  still  contains  the  location  where  ay  should  be  added.  Since 
we  choose  Tsub  and  Tou,  to  be  non-empty  subtrees,  the  size  of  Tc  decreases  on  every  iteration,  so 
the  algorithm  must  eventually  exit  the  loop. 

When  this  happens,  \TC\  <  2  and  Tc  contains  the  location  of  ay.  If  \TC\  =  1,  then  the  only  place 
to  add  Xi  is  as  a  child  of  the  node  in  Tc.  This  only  happens  if  anccstorfay,  ay,  xk)  =  r  in  the  last 
iteration  of  the  while  loop,  so  the  distance  d(oy,  y)  is  the  correct  edge  weight  for  the  new  edge. 


If  \TC\  =  2,  then  we  use  two  additional  leaves  to  determine  how  to  place  ay.  Case  analysis 
reveals  that  our  procedure  correctly  places  ay  into  Tc.  Thus,  we  conclude  that  the  add  procedure 
correctly  update  Tj_i  to  contain  ay.  By  iteratively  applying  this  argument,  we  arrive  at  the  claim. 


Turning  to  the  noisy  setting,  we  can  no  longer  deterministically  guarantee  correct  recovery  of 
T,  but  instead  require  a  probabilistic  analysis.  In  the  algorithm,  we  choose  three  nodes  ay,  ay 
and  xk  and  compute  distances  between  these  nodes  and  y  =  ancestor  (ay,  ay,  xk).  We  need  to  be 
able  to  correctly  determine  if  y  lies  between  the  root  r  and  ay,  between  r  and  ay,,  or  elsewhere 
in  the  tree.  We  therefore  seek  to  bound  \d(xk,y)  —  d(xk,y)\  and  \d(x3,y)  —  d(xj,y)\  where  d 
corresponds  to  our  empirical  estimate  of  the  distance  based  on  n  samples. 

To  arrive  at  these  bounds,  we  first  derive  concentration  inequalities  for  the  directly  observed 
measurements.  Specifically,  by  application  of  the  Subgaussian  tail  bound  and  the  union  bound 
we  have  that  with  probability  >1  —  5: 


\d(xi,Xj)  -  d(xi,Xj)\  < 


2a2  log(2 p2/S) 


n 


for  all  leaves  ay,  ay,  i,  j  G  [p\.  Using  this  bound  along  with  Equation  4.1,  immediately  reveals 
that  the  distance  in  the  estimated  tree  between  any  two  nodes  deviates  from  the  correct  distance 

by  at  most  | 

In  order  for  the  algorithm  to  work,  we  need  to  ensure  that  we  can  identify  when  the  ancestor 
node  y  equals  the  root  node  r,  in  spite  of  the  deviations.  If: 


7  >  3 


2cr2  log(2p2/5) 


n 


(4.4) 
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then  with  high  probability  we  will  not  confuse  the  nodes  y  and  r,  since  distances  to  each  node 
only  deviate  by  half  that.  Inverting  Equation  4.4  yields  the  bound  on  n  in  the  theorem. 


4.5.2  Proof  of  Theorem  4.2 


We  study  the  add  procedure.  By  Lemma  1  in  [13  ],  we  know  that  for  any  Tc  there  exists  a 
subtree  Tout  for  which: 


|ivs(rc)| 

deg(Tc)  +  1 


<  |lvs(Tc)\lvs(Tout)|  < 


|lvs(rc)|deg(Tc) 
deg(Tc)  +  1 


Let  lc  =  deg (Tc).  The  fact  that  |lvs(Tc)  \  lvs(Tout)|  <  |lvs^c1)|ic  means  that  |lvs(Tout)|  >  |1^)l. 

Writing  T*  to  denote  Tc  after  i  iterations  of  the  loop,  we  see  that  no  matter  how  the  search 
proceeds,  |lvs(Tc*)|  <  ^j-|lvs(T*_1)|. 

Thus  the  number  ofiterations  required  to  place  Xj  in  Tj_i  is  at  most  logic+i  (i  —  1)  <  2/clog(z  —  1). 

lc 

This  follow  since: 

logthiti  -  1)  =  l0£,(i — 4-  <  fc  log(<  -  1)  <  2(clog(i  -  1) 

■'  log  (l  +  c)  c“ 


The  first  inequality  is  based  on  the  Taylor  expansion  log(l  +  1  / x)  >  7  —  772  and  the  second  one 
holds  provided  that  lc  >  1,  which  is  always  true  here.  Since  each  loop  iteration  uses  a  constant 
number  of  pairwise  distance  measurements,  lc  is  upper  bounded  by  l  the  maximum  degree  of  T, 
and  we  call  add  at  most  p  times,  we  see  that  the  measurement  complexity  is  0(pl  log 72)  in  the 
absence  of  noise. 


Linally,  recall  from  Theorem  4.1  that  if  n  is  O(logp)  we  can  guarantee  exact  recovery  of  the  tree. 
We  must  therefore  observe  each  measurement  O(logp)  times  and  including  this  multiplicative 

factor  results  in  the  stated  bound. 


4.5.3  Proof  of  Theorem  4.3 

We  first  state  and  prove  several  lemmas,  and  then  turn  to  the  task  of  recovering  the  splits. 
Lemma  4.6  (Sampling).  Let  T  have  balance  factor  r)  and  maximum  degree  k.  Then  in  all 
iterations  of  RISING  and  SPLIT,  with  probability  >  1  —  the  sampled  subtree  ofT  with  leaf 
set  M  has  balance  factor: 


fj  <  2q  +  1, 

as  long  as  m  >  4(1  +  (k  —  l)q)2  log (pk). 
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Proof.  In  this  proof  we  will  simultaneously  work  with  all  of  the  recursive  calls  of  Rising.  Since 
each  call  recovers  one  internal  node,  and  there  can  be  no  more  than  p  internal  nodes  in  T,  we  can 
enumerate  the  calls  from  1  to  p.  Each  call  operates  on  a  subset  of  leaf  nodes  and  we  will  refer  to 
the  tree  induced  by  those  leaves  as  Ts  for  the  .sth  call. 

For  fixed  s,  define  the  random  variables  Z%3 ,  i  G  [rri] ,  j  G  [k]2,  which  takes  value  1  if  the  zth  leaf 
sampled  belongs  in  Tf,  the  jth  subtree  of  r  (the  root  of  Ts).  Further  define  T-  to  be  the  sampled 

version  of  Tf,  that  is  the  tree  T-  restricted  to  only  the  leaves  in  M.  Notice  that  E  [Z^]  =  jj^^j 
and  that  |lvs(TJ)|  =  Y1T=  1  By  Hoeffding’s  inequality  we  have  that: 

p  (i^lvs<f/>i "  rPH  >  e)  ^  2exp{-2™2}. 


for  any  single  j  G  [k] ,  s  G  \p\ .  We  would  like  to  do  this  across  all  calls  to  Split,  and  for  each  sub¬ 
tree  in  any  of  the  calls.  We  take  a  union  bound  across  all  internal  nodes  and  all  subtrees,  and  then 
rewrite  to  introduce  dependence  on  the  balance  factor  77,  noting  that  lvsfXjE)  <  ? / 1 1  v s  ( )  |  for 
any  internal  node  3.  This  gives  us  that: 


i|ivs(f(yi  < 


iivs(T’fi))i  _  /iog(2pfc/^y 
|lvs(Ts)|  V  2  m 

^|lvs(rfi))l  l\og(2pk/5i) 
|lvs(Ts)  +  V  2  m 


Note  that  since  we  have  established  concentration  inequalities  for  all  subtrees,  the  new  balance 
factor  f)  depends  only  on  the  lower  bound  for  the  smallest  subtree  size  and  the  upper  bound  for 
the  largest  subtree  size.  Now  let  m  =  c  log (/;/;:)  and  set  =  Z.  With  these  settings  we  have: 


i|lvs(T(i1))|  > 


|lvs  (T(*1))| 
|lvs(Ts)| 


and 


1 

m 


|lvs(Tffc))| 


< 


77|1vs(T^)| 

|lvs(T-)| 


The  new  balance  factor  is  the  ratio  of  these  two  quantities.  To  find  the  worst  case  fj,  we  need 
to  maximize  with  respect  to  lvs(T;s^) |.  It  is  easy  to  verify  that  the  maximum  is  achieved  at 
the  smallest  possible  size  for  TZ,  and  given  a  balance  factor  of  //,  we  have  that  lvs(T;T)  |  > 
achieved  when  the  remaining  subtrees  are  all  of  the  same  size.  Plugging  in  this  value 
for  |lvs(T(s1))|  we  have: 


fj  < 


_ U _ L  fl 

l+(fc—  l)rj  T  y  c 


_ 1 _ 

i+(fc— 1)77  y  c 


2 we  use  [m]  to  denote  {1, . . . ,  m} 

3we  use  TJY , . . . ,  Tf,  to  denote  the  subtrees  of  Ts  in  increasing  sorted  order  by  number  of  leaves 
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Now  as  long  as  c  >  (1  +  (k  —  1  )rf)2,  this  quantity  is  guaranteed  to  be  positive  and  if  c  = 
4(1  +  (k  —  1  )q)2,  then  some  algebra  shows  that: 

V  <  2r]  +  1 


Lemma  4.7  (Clustering).  Suppose  that  the  probability  of  an  uncorrupted  entry  q4  >  Crij.  and: 


m  Cfjfi 


log(m2/ S2) 

(<?4  -  Cr,,k) 


(4.5) 


for  constants  Cri±  <  1,  cqhk  that  depend  on  f)  and  the  max  degree  k.  Then  with  probability 
>  1  —  52,  Single  Linkage  clustering  on  M  using  s(xt.  xf  as  the  similarity  between  x,  and  x3 
partitions  M  such  that  either  each  subtree  is  entirely  contained  in  one  cluster  C  G  C,  or  if  a 
subtree  is  split  across  clusters,  those  clusters  contain  no  nodes  from  other  subtrees. 

Remark  4.1.  While  we  have  suppressed  dependence  on  fj  in  Lemma  4. 7,  we  note  that  a  critical 
condition  for  correctness  is  that  rj  =  0(1).  This  condition  ensures  that  single  linkage  clustering 
completely  groups  any  individual  subtrees  ofT  before  merging  it  with  any  other  subtree  and  is 
required  for  our  algorithms  to  be  robust  to  noise. 


Proof.  The  proofs  for  Rising  and  Split  are  almost  identical.  We  tailor  our  proof  to  the  former, 
noting  where  modifications  need  to  be  made  for  the  latter. 

Our  strategy  is  to  lower  bound  the  quantity  s(xt.  xf  for  any  pair  of  leaves  xr,  Xj  that  belong  to  the 
same  subtree  and  to  upper  bound  s(xi ,  xk)  if  x,  and  xk  do  not  belong  to  the  same  subtree.  Under 
the  conditions  on  q,  we  show  that  this  lower  bound  exceeds  the  upper  bound  and  this  guarantees 
that  one  subtree  will  be  fully  contained  in  any  cluster  before  any  two  subtrees  are  merged.  This 
means  that  either  a  subtree  is  fully  contained  in  a  cluster  or  if  it  is  split  across  clusters,  no  nodes 
from  other  subtrees  are  in  these  clusters. 

To  assist  in  our  analysis  we  use  the  following  notation.  As  above,  we  write  Tt  to  be  the  ;'th  subtree 
of  r  the  root  node  in  the  definition  of  balance  factor,  restricted  to  the  leaves  in  M.  Let  s*(xl,  xf 
be  the  value  of  s(xi,  Xj )  with  this  subsampling  but  in  the  absence  of  any  noise  in  the  distances. 
Let  Gij  be  the  group  of  nodes  xk  that  all  have  the  same  d(xl7  xk)—d(xj,  xk)  value  and  that  achieve 
the  maximum  in  the  computation  of  s (x i ,  x3 ) .  In  particular,  this  means  s*(xt,  xf  =  |  Gl3  \ .  Define 
TL), . . . ,  T(fc)  to  be  the  subtrees  of  the  subsampling  ordered  by  increasing  number  of  leaves. 

Finally  define  nTmin  =  Eti’  |lvs(f(i))|  and  /c£in  =  Etfc-i  llvs(^(d)l-  Kxmn  is  a  lower  bound  on 
s*(xi,  Xj)  for  Xi ,  Xj  in  the  same  subtree  and  K^in  is  an  upper  bound  on  m  —  s*(xi ,  xf)  for  xk 
in  different  subtrees. 

We  now  lower  bound  s(xi,Xj )  for  xt,  x3  in  the  same  subtree.  In  the  presence  of  noise,  any  node 
xt  G  G^  remains  in  GtJ  as  long  as  d(xiy  xt)  and  d(xj,  xt )  are  not  corrupted,  which  occurs  with 
probability  at  least  q2 .  Thus: 

E[s(xi,Xj)]  >  q2s*(xi,Xj ) 
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Since  each  xt  contributes  to  s(xt,  x3)  independently  and  since  there  are  GV)  nodes  xt,  we  can 
use  Hoeffding’s  Inequality,  coupled  with  a  union  bound,  to  show  that  with  probability  >  1  —  8C\. 


s(xi ,  Xj)  >  q2s*(xi ,  Xj)  —  m 


l\og(m2/Scl) 
2kL 


(4.6) 


for  all  pairs  i,  j  that  belong  in  the  same  subtree.  This  is  our  lower  bound. 

For  Split,  we  analogously  define  Gij  =  {k  :  d(xi,xk)  —  d(xj,xk)  =  d(xi,xz(k))—d(xj,xZ(k))} 

and  we  require  that  four  measurements  are  uncorrupted.  The  above  argument,  tailored  to  this 
scenario  gives  (with  probability  >  1  —  <5ci): 


E [s(xi,Xj)]  >  q'is*(xi,Xj 


and 


s(xi,Xj)  >  q*s*(xiiXj)  —  m 


'log(m2/  Scl) 


2  kI 


For  the  upper  bound,  we  can  see  that  a  node  can  contribute  to  s(xi,xk )  if  it  contributes  to 
s*(xi,Xk )  and  it  uses  no  corrupted  measurements  or  if  it  does  not  contribute  to  s*(xi,xk )  and 
it  contains  a  corrupted  measurement.  For  the  first  case,  we  will  assume  pessimistically  that  all  of 
the  nodes  xt  G  Gik  contribute  to  s(xtlXk).  For  the  latter,  we  again  perform  a  worst  case  analysis 
where  we  assume  any  xt  Gtj  for  which  either  d(xnxt)  or  d(xk,  xt)  are  corrupted  contributes 
to  s(xi,xk).  Thus  any  x,  contributes  with  probability  1  —  q2.  If  we  write  s2(xi,xk)  to  denote 
the  number  of  nodes  xt  GtJ  that  could  contribute  to  s(xt,  xk),  then  by  the  same  techniques  as 
above,  we  arrive  at  the  following  upper  bound: 

E[s2(xj,  xk)]  <  (1  -  q2)(m  -  s*(xi,xk)) 


$2 {xi,xk)  <  (1  -  q2){m  -  s*(xuxk ))  +  m 

Where  the  second  statement  holds  with  probability  >  1  —  d co¬ 
in  order  to  ensure  success  of  our  clustering  algorithm,  we  need  the  lower  bound  for  s(xi,xk)  to 
be  larger  than  the  upper  bound  for  s(xi,  xk). 

Setting  S2  =  5C\  =  dc2,  we  can  now  bound  q  as: 


'log(m2/  5c2) 


Q 


2 


> 


X 


- 7 - ~T - 7 - V  +  \  -  log(m2/^9) 

m  +  s*(xi,Xj)  -  s*(xi,xk)  V  2 

S*(Xj,Xj)y/l/K^nin  +  (m  -  S*(Xi,Xk))y/l/K 
m  +  s*(xi,Xj)  -  s*(xi,xk) 


A 

min 


For  this  inequality  to  hold,  we  require  that  s*(xi,Xj )  >  s*(xi,xk),  but  this  is  always  the  case 
since  s*(xi,Xj)  —  s*(xi,xk )  >  |lvs(T(i))|,  i.e.  the  size  of  the  smallest  subtree. 
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To  better  illustrate  the  dependence  between  q  and  the  various  parameters  of  the  problem,  we 
simplify  the  expression  using  the  following  bounds  (which  are  straightforward  to  verify): 


kt  > 
^min  — 


rnr\ 


1  +  (k-  1)?)’ 


kA  > 

^ min  — 


2  mf) 


1  +  {k  -  1)?/’ 


|lvs(f(1))|  > 


m 


1  +  (k  -  1  )r) 


Using  this  bounds  we  arrive  at  the  following  lower  bound  on  q2: 

X  _|_  l+v^  ^/log(m2/52)(l+(fc-l)i7) 


q 2  > 


i  + 


i 


i+(fc— 1)77 


Specifically,  this  means  that  the  constant  Cfj.  and  crij.  in  the  lemma  are: 


Cfj,k  — 


1  +  (k  -  l)ff 

2  +  (k  -  1  )fj 


and 


Crj,k 


(1  +  y/2)(l  +  (fc-l)r))3/2 
2^(2  +  (k  -  1)77) 


Plugging  in  these  constants  and  reorganizing  the  expression  results  in  Equation  4.5.  Both  con¬ 
stants  depend  on  both  fj  and  k,  however  notice  that  C;hk  <  1  and  both  Cr).k  and  Crij.  are  smaller 
for  7)  close  to  1.  Thus  we  see  that  it  is  easier  to  cluster  more  balanced  trees. 

The  analysis  for  Split  is  the  same,  except  that  we  require  q4  to  be  greater  than  the  right  hand 
side  of  above  lower  bound  on  q2.  Since  this  dependence  is  worse  than  the  one  for  Rising,  we 

use  this  expression  in  our  result. 

Lemma  4.8  (Voting).  Suppose  that  q6  >  Then  with  probability  >  1  —  S:>  the  voting  phase 
0/R1SING  and  Split  correctly  partition  the  leaves  into  their  subtrees  as  long  as: 


771  Cjj^k 


log  ip/h) 


for  some  constants  cqhk,  G'ry;,  that  depends  on  fj  and  k. 


(4.7) 


Proof.  The  voting  procedure  works  by  taking  one  node  from  each  cluster  in  C  and  computing 
the  quartet  between  those  three  nodes  and  the  node  we  are  trying  to  place,  xt.  Suppose  that  xr 
belongs  in  cluster  C *;  then  it  must  be  the  case  that  C*  G  C  or  there  exists  some  C'  G  C  such  that 
C*  C  C" .  This  latter  case  can  happen  if  we  merge  two  subtrees  in  the  clustering  phase. 

Since  C  always  has  cardinality  3  in  Algorithm  10,  when  we  draw  one  node  from  each  of  the  three 
clusters  one  of  two  things  can  happen.  If  we  draw  a  node  from  C*  then  in  the  absence  of  noise, 
this  quartet  would  correctly  vote  that  Xi  belongs  in  the  cluster  C' .  If  on  the  other  hand,  we  draw 
a  node  from  C  \C*,  then  in  the  absence  of  noise  this  quartet  would  vote  that  xt  forms  a  star. 
Our  analysis  must  consider  both  of  these  scenarios. 
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Specifically,  let  Z,  be  the  indicator  that  the  /th  quartet  test  correctly  voted  that  xt  belongs  in  C . 
We  perform  0  =  |T(i)|  rounds  of  voting  and  by  application  of  a  Hoeffding’s  Inequality  and  a 
union  bound: 


for  each  Xi  G  X\M  and  for  some  constant  c  that  depends  only  on  on  fj  and  k  (c  =  1  +  ( A:  — 1)17  > 
l/|lvs(Tm)  |).  We  see  that  with  probability  <53,  the  fraction  of  correct  votes  is  bounded  from  below 
as  long  as  m  =  ui(y/\ogp)  so  that  the  second  expression  — y  0  as  p  — *  00. 

We  will  need  a  similar  concentration  bound  on  the  number  of  votes  that  form  a  star.  Define  Wt 
to  be  the  indicator  that  the  ith  quartet  test  correctly  forms  a  star.  By  a  similar  argument  we  see 
that  with  probability  >  1  —  S3: 


for  all  Xi  G  X  \  M. 

To  guarantee  that  we  place  Xi  correctly,  we  will  pessimistically  assume  that  every  vote  not  for 
C'  and  not  for  a  star  will  vote  for  the  same  C  G  C,  C  ^  C' .  Thus  the  fraction  of  votes  for  C  is 
1  —  Z  —  W  and  we  require  that  Z  >  1  —  Z  —  W.  Some  algebra  shows  that  this  is  true  if: 


Inverting  this  equation  gives  us  the  lower  bound  on  m  in  the  Lemma.  The  constant  is  exactly 


I 1  <  2+(fc_i)^  which  is  the  same  as  the  constant  in  Lemma  4.7. 


Recovering  One  Split 

Each  time  we  call  Rising  or  Split  we  attempt  to  recover  one  internal  node  of  the  tree.  In  terms 
of  dependence  on  m,  we  showed  above  that  as  long  as  m  is  sufficiently  large,  the  sampling  phase 
will  result  in  a  new  balance  factor  fj  that  is  not  too  different  from  the  original  balance  factor  77 
and  that  Single  Linkage  will  produce  clusters  that  reflect  the  subtrees.  Combining  the  bounds  on 
m  from  all  three  phases,  we  have  the  following  lower  bound  on  m: 
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And  the  restrictions  on  the  probability  of  an  uncorrupted  entry  arise  from  the  clustering  and 
voting  phases,  but  the  voting  phase’s  condition  is  more  stringent.  We  therefore  need  q 6  >  Cfhk 

Finally,  we  require  that  the  balance  factor  of  the  tree  q  =  0(1)  so  that  fj  will  also  be  a  constant 
for  large  enough  m  with  high  probability. 

Putting  these  conditions  together,  we  can  characterize  the  dependence  on  m  and  p  under  which 
successful  recovery  of  a  single  split  is  possible.  Specifically,  we  have  that  if  m  =  Q(\og(p/8)), 
then  with  probability  >1  —  5  (where  8  =  0  +  d2  +  53),  we  correctly  recover  one  internal  node. 


Recovering  All  Splits 

There  are  at  mostp  internal  nodes  in  the  tree.  To  recover  all  of  these  with  probability  1  —  o(l),  we 
set  each  5;  =  0(l/p),  and  again  characterize  the  dependence  between  m  and  p.  In  the  sampling 
phase,  we  require  that  m  =  cu(logp)  to  ensure  that  q  does  not  grow  with  p.  In  clustering,  we 
similarly  require  m  =  u(\og(m2p)).  Finally,  in  the  voting  phase,  we  see  that  m  =  cu(log(p)). 

These  bounds  determine  conditions  for  successful  recovery  of  the  entire  tree. 


4.5.4  Proof  of  Theorem  4.4 

We  will  analyze  each  level  of  the  tree.  Since  q  is  bounded,  there  are  O(logp)  levels  of  the  tree. 

At  each  level,  let  C  be  the  set  of  all  groups  we  are  trying  to  split  at  this  level,  that  is  each  C  G  C 
is  the  set  of  nodes  passed  in  as  the  first  parameter  to  Split,  or  in  the  case  of  the  first  call,  C  just 
contains  one  set  with  all  of  the  nodes.  For  each  group  C  G  C  let  pc  denote  the  number  of  nodes 
in  C  and  let  rric  denote  the  value  of  the  parameter  m  which  can  be  a  function  of  \C\  4. 

For  each  cluster  C,  we  require  mc(mc  +  l)/2  measurements  between  sampled  nodes  and,  in 
Split,  an  additional  me  measurements  from  the  set  y.  In  the  voting  phase,  we  vote  on  pc  —  me 
nodes  and  for  each  node  we  require  me  +  1  measurements  to  the  sampled  nodes  and  to  one  node 
in  y.  Putting  this  together,  we  have  that  at  any  level,  we  use: 

Emcimc  + 1)  .  \  ^ 

- - - h  mc  +  {pc  ~  mc)(mc  +  1)  <  2_^Pc(:™c  +  1)  <  p(m  +  1), 

CeC  cec 

as  long  as  mo  >  1  for  all  C,  and  where  m  =  rnp  is  the  value  of  m  passed  into  the  call  to 
Rising,  i.e.  it  is  the  largest  value  of  m  across  all  calls  to  Rising  and  Split.  Here  we  used  that 
Scec  Pc  =  P-  Therefore,  regardless  of  the  balancedness  of  the  tree,  at  each  level  we  use  0(pm) 
measurements,  and  as  described  above,  there  are  O(logp)  levels  resulting  in  a  measurement 
complexity  of  0(pm  log p) .  The  factor  of  l  arises  because  each  call  to  Split  splits  the  subtrees 
of  a  node  into  two  groups;  it  may  take  up  to  l  calls  to  recover  each  internal  node. 

4Specifically  m  =  m(\C\)  can  be  any  increasing  function  of  |Cj 
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Lastly,  we  can  compute  edge  lengths  using  0(m)  measurements.  Since  this  is  dominated  by  the 
above  bounds,  we  ignore  this  dependence. 


4.6  Conclusion 


In  this  chapter  we  studied  the  multi-source  network  tomography  problem.  We  developed  two 
algorithms,  with  theoretical  guarantees,  to  construct  tree  metrics  that  approximate  distances  be¬ 
tween  end  hosts  in  a  network.  We  also  demonstrated  the  effectiveness  of  these  algorithms  on  real 
world  datasets. 

Turning  to  the  themes  of  this  thesis,  this  chapter  lends  evidence  to  our  three  claims  about  interac¬ 
tive  learning.  First,  while  we  did  not  explicitly  compare  with  non-interactive  algorithms,  we  did 
show  that  our  interactive  approaches  have  strong  guarantees  on  both  statistical  performance  and 
measurement  complexity.  In  contrast,  naive  non-interactive  approaches  would  have  significantly 
higher  measurement  complexity  to  achieve  the  same  level  of  statistical  performance.  Thus,  we 
see  evidence  for  the  fact  that  interactivity  lends  statistical  power  in  unsupervised  problems. 

Regarding  computation,  our  two  algorithms  are  also  computationally  efficient.  As  we  saw,  both 
of  our  algorithms  have  0(/rpolylog(//)  )  computational  complexity.  Naive  algorithms  have  sig¬ 
nificantly  worse  running  time;  the  most  obvious  algorithm  would  compute  all  quartets  and  stitch 
these  structures  together,  and  therefore  run  in  0  (p4 )  time.  While  it  would  be  desirable  to  have 
linear  time  algorithms,  we  already  see  supporting  evidence  for  the  fact  that  interactivity  brings 
computational  efficiency. 

Lastly,  in  the  additive  noise  model,  we  measured  uniformity  via  the  degree  of  the  tree.  We  saw 
that  the  measurement  complexity  of  both  algorithms  degrades  as  the  problems  become  more 
uniform  (higher  degree),  and  if  the  tree  has  Q (p)  degree,  then  our  algorithm  matches  a  naive 
non-interactive  approach.  This  lends  evidence  to  our  claim  that  interactivity  is  powerful  in  the 
presence  of  non-uniformity. 
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Chapter  5 


Minimaxity  in  the  Structured  Normal 
Means  Problem 


The  prevalence  of  high-dimensional  signals  in  modem  scientific  investigation  has  inspired  an 
influx  of  research  on  recovering  structural  information  from  noisy  data.  These  problems  arise 
across  a  variety  of  scientific  and  engineering  disciplines;  for  example  identifying  cluster  struc¬ 
ture  in  communication  or  social  networks,  multiple  hypothesis  testing  in  genomics,  or  anomaly 
detection  in  vision  and  sensor  networking.  Broadly  speaking,  this  line  of  work  shows  that  high¬ 
dimensional  statistical  inference  can  be  performed  at  low  signal-to-noise  ratios  provided  that  the 
data  exhibits  low-dimensional  structure.  Specific  structural  assumptions  include  sparsity  [  ], 

low-rankedness  [  ],  cluster  structure  [  18],  and  many  others  [  6]. 

The  literature  in  this  direction  focuses  on  three  inference  goals:  detection,  localization  or  recov¬ 
ery,  and  estimation  or  denoising.  Detection  tasks  involve  deciding  whether  an  observation  con¬ 
tains  some  meaningful  information  or  is  simply  ambient  noise,  while  recovery  and  estimation 
tasks  involve  more  precisely  characterizing  the  information  contained  in  a  signal.  Specifically, 
in  recovery  problems,  the  goal  is  to  identify,  from  a  finite  collection  of  signals,  which  signal 
produced  the  observed  data.  The  estimation  or  denoising  problem  involves  leveraging  structural 
information  to  produce  high-quality  estimates  of  the  signal  generating  the  data.  These  prob¬ 
lems  are  closely  related,  but  also  exhibit  important  differences,  and  this  chapter  focuses  on  the 
recovery  problem. 

One  frustration  among  researchers  is  that  algorithmic  and  analytic  techniques  for  these  prob¬ 
lem  differ  significantly  for  different  structural  assumptions.  This  issue  was  recently  resolved 
in  the  context  of  the  estimation  problem,  where  the  atomic  norm  [  ]  has  provided  a  unifying 

algorithmic  and  analytical  framework,  but  such  a  theory  for  detection  and  recovery  problems 
remains  elusive.  In  this  chapter,  we  provide  a  unification  for  the  recovery  problem,  giving  us 
better  understanding  of  how  signal  structure  affects  statistical  performance. 

Modern  measurement  technology  also  often  provides  flexibility  in  designing  strategies  for  data 
acquisition,  and  this  adds  an  element  of  complexity  to  inference  tasks.  As  a  concrete  example, 
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crowdsourcing  platforms  allow  for  interactive  data  acquisition,  which  can  be  used  to  recover 
cluster  structure  with  lower  measurement  overhead  [  5,  160].  Non-interactive  experimental 

design-based  (i.e.  non-uniform)  data  acquisition  is  also  enabled  by  modern  sensing  technol¬ 
ogy,  leading  to  two  important  questions:  (1)  How  do  we  design  sensing  strategies  for  structure 
recovery  problems?  (2)  When  should  interactive  acquisition  be  preferred  to  non-interactive  ac¬ 
quisition?  We  provide  an  answer  to  the  first  of  these  questions,  and  progress  toward  an  answer 
to  the  latter. 

To  concretely  describe  our  main  contributions,  we  now  develop  the  decision-theoretic  framework 
of  this  chapter.  We  study  the  structured  normal  means  problem  defined  by  a  finite  collection 
of  vectors  V  =  MIL i  c  that  index  a  family  of  probability  distributions  Pj  =  Af(vj,  Id)- 
An  estimator  T  for  the  family  Visa  measurable  function  from  Rf/  to  [M] ,  and  its  maximum  risk 
is: 

K(T,  V)  =  sup  7 Zj(T,  V),  TZj (T ,  V)  =  P j[T(y)  ^  j], 

j  £  [M] 

where  we  always  use  y  ~  P,  to  be  the  observation.  We  are  interested  in  the  minimax  risk: 

K{V)  =  inf  K{T,  V)  =  inf  sup  ¥j[T(y)  ±  j}.  (5.1) 

T  T  je[M] 

We  call  this  the  isotropic  setting  because  each  gaussian  has  spherical  covariance.  We  are  specif¬ 
ically  interested  in  understanding  how  the  complexity  of  the  family  V  influences  the  minimax 
risk.  This  setting  encompasses  recent  work  on  sparsity  recovery  [  ],  biclustering  [  ,118], 

and  many  graph-based  problems  [  ] .  An  example  to  keep  in  mind  is  the  fc-sets  problem,  where 

the  collection  V  is  formed  by  vectors  yls  for  subsets  S  C  \d\  of  size  k  and  some  signal  strength 
parameter  //. 

We  also  study  the  experimental  design  setting,  where  the  learning  algorithm  can  specify  a 
sensing  strategy,  defined  by  a  vector  B  e  M(|_.  Using  this  strategy,  under  P:i ,  the  observation  is: 

y(i)  ~  Vj(i)  +  1)  =  J\f(vj(i),B{i)~l),\/i  e  [d\.  (5.2) 

If  B{i)  =  0,  then  we  say  that  y(i)  =  0  almost  surely.  We  call  this  distribution  P j  B,  to  denote 
the  dependence  both  on  the  target  signal  Vj  and  the  sensing  strategy  B.  The  total  measurement 
effort,  or  budget,  used  by  the  strategy  is  ||-B||i,  and  we  are  typically  interested  in  signal  recovery 
under  some  budget  constraint.  Specifically,  the  minimax  risk  in  this  setting  is: 

1Z(V,t)=  inf  sup  P j,B[T(y)^j].  (5.3) 

T,B:||B||i<t  je[M\ 


With  this  formalization,  we  can  now  state  our  main  contributions: 

1 .  We  give  nearly  matching  upper  and  lower  bounds  on  the  minimax  risk  for  both  isotropic 
and  experimental  design  settings  (Theorems  5.1  and  5.5).  This  result  matches  many  special 
cases  that  we  are  aware  of  [161],  which  we  show  through  examples.  Moreover,  in  exam¬ 
ples  with  an  asymptotic  flavor  (defined  below),  this  shows  that  the  maximum  likelihood 
estimator  (MLE)  achieves  the  minimax  rate. 
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2.  In  the  isotropic  case,  we  derive  a  condition  on  the  family  V  under  which  the  MLE  exactly 
achieves  the  minimax  risk,  which  certifies  optimality  of  this  estimator.  In  this  case,  we  also 
give  a  heuristic  algorithm  that  exploits  connections  to  Bayesian  inference  and  attempts 
to  improve  on  the  MLE,  This  algorithm  gives  some  insights  into  how  to  appropriately 
regularize  an  inference  problem. 

3.  We  give  sufficient  conditions  that  certify  an  optimality  property  of  an  experimental  design 
strategy  and  also  give  an  algorithm  for  computing  such  a  strategy  prior  to  data  acquisi¬ 
tion.  We  give  an  example  where  a  non-uniform  strategy  outperforms  the  isotropic  one  and 
two  examples  (one  well-known  and  one  new)  where  interactive  strategies  provably  out¬ 
perform  all  non-interactive  ones.  This  latter  result  shows  that  interactive  sampling  can  be 
significantly  more  powerful  than  non-interactive  experimental  design. 


5.1  Related  Work 


The  structured  normal  means  problem  has  a  rich  history  in  statistics,  although  the  majority  of 
work  focus  on  detection  or  estimation  in  nonparametric  settings,  for  example  when  the  signals 
belong  to  Besov  or  Sobolev  spaces  [109,  110].  More  recently  attention  has  turned  to  combina¬ 
torial  structures  and  the  finite  dimensional  case.  This  line  is  motivated  by  statistical  applications 
involving  complex  data  sources,  such  as  tasks  in  graph- structured  signal  processing  [  56],  and 
the  broad  goal  is  to  understand  how  combinatorial  structures  affect  both  statistics  and  computa¬ 
tion  in  these  inference  problems. 

Focusing  on  detection  problems,  a  number  of  papers  study  various  combinatorial  structures, 
including  k- sets  [  ],  cliques  [  6 1],  paths  [  0],  and  clusters  [156]  in  graphs,  and  for  many  of 
these  problems,  near-optimal  detection  is  possible.  For  example,  Addario-Berry  et  al.  [  ]  show 
that  to  test  between  the  null  hypothesis  that  every  component  of  the  vector  is  A/"(0, 1)  and  the 

alternative  that  k  components  have  mean  //,  the  detection  threshold  is  //  x  y/log(l  +  p).  This 
means  that  if  /r  grows  faster  than  this  threshold,  one  can  achieve  error  probability  tending  to  zero, 
and  if  (i  grows  slower  than  this  threshold,  all  procedures  have  error  probability  tending  to  1.  This 
style  of  result  is  now  available  for  several  examples,  although  a  unifying  theory  for  detection 
problems  is  still  undeveloped. 

Turning  to  recovery  or  localization,  again  several  specific  examples  have  been  analyzed.  The 
most  popular  example  is  the  biclustering  problem,  where  V  corresponds  to  d±  x  do  matrices  of  the 
form  /ilc, ljr  with  Ci  C  [di],  Cr  C  \d2\  [  ,  1 18,  16  ].  However,  apart  from  this  example  and  a 

few  others  [  [61],  minimax  bounds  for  the  recovery  problem  are  largely  unknown.  Moreover,  we 
are  unaware  of  a  broadly  applicable  analysis,  like  the  method  we  develop  here. 

A  unified  treatment  is  possible  for  estimation  problems,  where  the  atomic  norm  framework  gives 
sharp  phase  transitions  for  the  maximum  likelihood  estimator  [6,  46].  The  atomic  norm  is  a 
generic  approach  for  encoding  structural  assumptions  by  decomposing  the  signal  into  a  sparse 
convex  combination  of  a  set  of  base  atoms  (e.g.,  one-sparse  vectors).  While  this  line  primarily 
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focuses  on  linear  inverse  problems  [6,  46],  there  are  results  for  the  estimation  problem  described 
above  [  ,  136].  While  we  are  unaware  of  minimax  bounds  for  either  setting,  it  is  well  known 

that  the  the  mean  squared  error  of  the  MLE  is  related  to  the  statistical  dimension  of  the  cone 
formed  by  the  atoms.  Unfortunately,  atomic  norm  techniques  rely  on  convex  relaxation  which 
enables  estimation  but  not  recovery,  as  the  minimax  probability  of  error  for  any  dense  family  is 
one.  Moreover,  the  non-convexity  of  our  risk  poses  new  challenges  that  do  not  arise  with  the 
strongly-convex  mean  squared  error  objective. 

While  much  of  the  literature  has  focused  on  the  isotropic  case,  there  has  been  recent  interest 
in  experimental  design  or  interactive  methods,  aiming  to  quantify  the  statistical  improvements 
enabled  by  interactivity.  The  first  result  in  this  line  is  a  simple  interactive  procedure  for  the 
fc-sets  recovery  problem  due  to  Haupt,  Castro  and  Nowak  [  0'  ].  More  recently,  Tanczos  and 
Castro  [  6 1]  study  more  structured  instantiations  and  show  more  significant  statistical  improve¬ 
ments  via  interactive  methods.  Their  work  makes  important  progress,  but  it  does  not  address  the 
general  problem,  as  they  hand-craft  sampling  algorithms  for  each  example.  A  unifying,  inter¬ 
active  algorithm  was  proposed  in  the  bandit  optimization  setting  [  8],  but,  in  our  setting,  it  is 
not  known  to  improve  on  non-interactive  approaches.  To  our  knowledge,  a  unifying  interactive 
algorithm  and  a  satisfactory  characterization  of  the  advantages  offered  by  interactive  sampling 
remain  elusive  open  questions.  This  chapter  makes  progress  on  the  latter  by  developing  lower 
bounds  against  all  non-interactive  approaches. 

Lastly,  there  is  a  close  connection  between  our  setting  and  the  channel  coding  problem  in  an 
Additive  White  Gaussian  Noise  (AWGN)  Channel  [59,  60].  In  channel  coding,  we  are  tasked 
with  designing  a  large  code  V  such  that  if  we  send  the  codeword  Vj ,  an  observer,  upon  observing 
y  ~  Af(vj,Id),  can  reliably  predict  the  codeword  sent.  While  the  error  metric  is  usually  the 
same  as  in  our  setup,  typical  coding-theoretic  results  focus  on  codebook  design ,  rather  than  error 
analysis  for  a  particular  codebook,  which  is  our  focus  here.  To  our  knowledge,  the  results  here 
do  not  appear  in  the  information  theory  literature. 


5.2  Main  Results 


In  this  section  we  develop  the  main  results  of  the  chapter.  We  start  by  bounding  the  minimax 
risk  in  the  isotropic  setting,  then  develop  a  certificate  of  optimality  for  the  maximum  likelihood 
estimator,  and  turn  to  the  algorithmic  question  of  computing  minimax  optimal  estimators.  Lastly, 
we  turn  to  the  experimental  design  setting.  We  provide  proofs  in  Section  5.5. 


5.2.1  Bounds  on  the  Minimax  Risk 

In  the  isotropic  case,  recall  that  we  are  given  a  finite  collection  V  of  vectors  and  an 

observation  y  ~  Af(vj,  Id)  for  some  j  e  [M],  Given  such  an  observation,  a  natural  estimator  is 
the  maximum  likelihood  estimator  (MLE),  which  outputs  the  index  j  for  which  the  observation 
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was  most  likely  to  have  come  from.  This  estimator  is  defined  as: 

7mle(z/)  =  argmax^g^Pj (y)  =  argminje[M]  \\Vj  -  y\\\.  (5.4) 

We  will  analyze  this  estimator,  which  partitions  based  on  a  Voronoi  Tessellation  of  the  set  V. 

As  stated,  the  running  time  of  the  estimator  is  O(Md),  but  it  is  worth  pausing  to  remark  briefly 
about  computational  considerations.  In  many  examples  of  interest,  the  class  V  is  combinatorial 
in  nature,  so  M  may  be  exponentially  large,  and  efficient  implementations  of  the  MLE  may  not 
exist.  However,  as  our  setup  does  not  preclude  unstructured  problems,  the  input  to  the  estimator 
is  the  complete  collection  V,  so  the  running  time  of  the  MLE  is  linear  in  the  input  size.  If 
the  particular  problem  is  such  that  V  can  be  compactly  represented  (e.g.  it  has  combinatorial 
structure),  then  the  estimator  may  not  be  polynomial-time  computable.  This  presents  a  real 
issue,  as  researchers  have  shown  that  a  minimax-optimal  polynomial  time  estimator  is  unlikely 
to  exist  for  the  biclustering  problem  [5  ,  128],  which  we  study  in  Section  5.3.  However,  since  the 
primary  interest  of  this  work  is  statistical  in  nature,  we  will  ignore  computational  considerations 
for  most  of  our  discussion. 


We  now  turn  to  a  characterization  of  the  minimax  risk,  which  involves  analysis  of  the  MLE.  The 
following  function,  which  we  call  the  Exponentiated  Distance  Function,  plays  a  fundamental 
role. 

Definition  5.1.  For  a  family  V  and  a  >  0,  the  Exponentiated  Distance  Function  (EDF)  is: 


W(V,  a)  =  max  WAV,  a) 

je[M] 


(5.5) 


In  the  following  theorem,  we  show  that  the  EDF  governs  the  performance  of  TMLE.  More  im¬ 
portantly,  this  function  also  leads  to  a  lower  bound  on  the  minimax  risk,  and  the  combination 
of  these  two  statements  shows  that  the  MLE  is  nearly  optimal  for  any  structured  normal  means 
problem. 

Theorem  5.1.  Fix  5  G  (0,1).  IfW(V,8)  <  6,  then  TZ(V)  <  7 Z(V,TMLE)  <  5.  On  the  other 
hand,  ifW{V ,  2(1  -  5))  >2^-1,  then  77(V)  >  5. 

In  particular,  by  setting  5  =  1/2  above,  the  second  statement  in  the  theorem  may  be  replaced  by: 
If  VE(V,  1)  >  3,  then  'JZ(  V)  >  1/2.  This  setting  often  aids  interpretability  of  the  lower  bound. 

Notice  that  the  value  of  a  disagrees  between  the  lower  and  upper  bounds,  and  this  leads  to  a  gap 
between  the  necessary  and  sufficient  conditions.  This  is  not  purely  an  artifact  of  our  analysis,  as 
there  are  many  examples  where  the  MLE  does  not  exactly  achieve  the  minimax  risk.  However, 
most  structured  normal  means  problems  also  have  an  asymptotic  flavor,  specified  by  a  sequence 
of  problems  Vi,  V2, . . .,  and  a  signal-strength  parameter  ft,  with  observation  y  Id)  for 

some  signal  Vj  in  the  current  family.  In  this  asymptotic  framework,  we  are  interested  in  how  // 
scales  with  the  sequence  to  drive  the  minimax  risk  to  one  or  zero.  Almost  all  existing  examples 
in  the  literature  are  of  this  form  [161],  and  in  all  such  problems,  Theorem  5.1  shows  that  the 
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MLE  achieves  the  minimax  rate.  To  our  knowledge,  such  a  comprehensive  characterization  of 
recovery  problems  is  entirely  new. 

Application  of  Theorem  5.1  to  instantiations  of  the  structured  normal  means  problem  requires 
bounding  the  EDF,  which  is  significantly  simpler  than  the  typical  derivation  of  this  style  of  result. 
In  particular,  proving  a  lower  bound  no  longer  requires  construction  of  a  specialized  subfamily 
of  V  as  was  the  de  facto  standard  in  this  line  of  work  [118,  161].  In  Section  5.3,  we  show  how 
simple  calculations  can  recover  existing  results. 

Turning  to  the  proof,  the  EDF  arises  naturally  as  an  upper  bound  on  the  failure  probability  of 
the  MLE  after  applying  a  union  bound  and  a  Gaussian  tail  bound.  Indeed  the  fact  that  the  EDF 
upper  bounds  the  minimax  risk  is  not  particularly  surprising.  It  is  however  more  surprising  that 
it  also  provides  a  lower  bound  on  the  minimax  risk.  We  obtain  this  bound  via  application  of 
Fano’s  Inequality,  but  we  use  a  version  that  allows  a  non-uniform  prior  and  explicitly  construct 
this  prior  using  the  EDF.  This  leads  to  our  more  general  lower  bound. 


5.2.2  Minimax-Optimal  Recovery 

Theorem  5.1  shows  that  that  maximum  likelihood  estimator  achieves  near-optimal  performance 
for  all  structured  normal  means  recovery  problems.  By  near-optimal,  we  mean  that  in  problems 
with  some  asymptotic  flavor,  where  the  family  of  vectors  grows  but  also  becomes  more  separated, 
the  maximum  likelihood  estimator  achieves  the  minimax  rate.  However,  in  many  cases  the  MLE 
is  not  the  optimal  estimator,  i.e.  it  does  not  achieve  the  exact  minimax  risk.  In  this  section, 
we  use  deeper  connections  between  the  minimax  risk  and  the  Bayes  risk  to  address  this  gap. 
Specifically,  we  give  a  sufficient  condition  for  the  minimax  optimality  of  the  MLE,  and  we  will 
also  design  an  algorithm  that  in  other  cases  produces  an  estimator  with  better  minimax  risk. 

Our  approach  is  based  on  a  well-known  connection  between  the  minimax  risk  and  the  Bayes  risk. 
For  a  structured  normal  means  problem  defined  by  a  family  V,  the  Bayes  risk  for  an  estimator  T 
under  prior  n  e  Am~i  is  given  by: 


M 

3= 1 

We  say  that  an  estimator  T  is  the  Bayes  estimator  for  prior  7 r  if  it  achieves  the  minimum  Bayes 
risk.  A  simple  calculation  reveals  the  structure  of  the  Bayes  estimator  for  any  prior  7r  and  this 
structural  characterization  is  essential  to  our  development. 

Proposition  5.2.  For  any  prior  it,  the  Bayes  estimator  Tn  has  polyhedral  acceptance  regions, 
that  is  the  estimator  is  of  the  form: 

T(Y)  =  j  if  y  E  Aj, 

with  Aj  =  (x  :  T jX  >  bj}  and  Tj  G  M.Mxd  has  Vj  —  Vk  in  the  kth  row  and  bj  has  |(||uy|||  — 
1 1  z; |||)  +  log  jf  in  the  kth  entry.  These  polyhedral  sets  Aj  partition  the  space  Mf/. 
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We  also  exploit  the  relationship  between  the  minimax  risk  and  the  Bayes  risk.  This  is  a  well 
known  result,  although  for  completeness  we  provide  a  proof  in  Subsection  5.5.4.  The  prior  it 
below  is  known  as  the  least-favorable  prior. 

Proposition  5.3.  Suppose  that  T  is  a  Bayes  estimator  for  some  prior  it.  If  the  risk  'JZ  f  T  )  = 
TZji(T)  for  all  j  f  j'  6  [M],  then  T  is  a  minimax  optimal  estimator. 

Our  main  theoretical  result  leverages  this  proposition  along  with  the  structural  characterization 
of  Bayes  estimators  to  certify  minimax  optimality  of  the  MLE.  The  sufficient  condition  for  opti¬ 
mality  depends  on  a  particular  structure  of  the  family  V: 

Definition  5.2.  A  family  V  is  unitarily  invariant  if  there  exists  a  set  of  orthogonal  matrices 
{ Ri such  that  for  each  vector  v  G  V,  the  set  { RiV }^1  is  exactly  V. 

In  other  words,  the  instance  V  can  be  generated  by  applying  the  orthogonal  transforms  to  any 
fixed  vector  in  the  collection.  Unitarily  invariant  problems  exhibit  high  degrees  of  symmetry  and 
via  Proposition  5.3,  can  be  shown  to  be  a  sufficient  condition  for  the  optimality  of  the  MLE. 
Theorem  5.4.  If  V  is  unitarily  invariant,  then  the  MLE  is  minimax  optimal. 

Some  remarks  about  the  theorem  are  in  order: 

1.  This  theorem  reduces  the  question  of  optimality  to  a  purely  geometric  characterization  of 
the  family  V  and,  as  we  will  see,  many  well  studied  problems  are  unitarily  invariant.  One 
common  family  of  orthogonal  matrices  is  the  set  of  all  permutation  matrices  on  W[. 

2.  This  result  does  not  characterize  the  risk  of  the  MLE;  it  only  shows  that  no  other  estimator 
has  better  risk.  Specifically,  it  does  not  provide  an  analytic  bound  that  is  sharper  than 
Theorem  5.1.  From  a  practitioner’s  perspective,  an  optimality  certificate  for  an  estimator 
is  more  important  than  a  bound  on  the  risk  as  it  help  govern  practical  decisions,  although 
risk  bounds  enable  theoretical  comparison. 

3.  Lastly,  the  result  is  not  asymptotic  in  nature  but  rather  shows  that  the  MLE  achieves  the 
exact  minimax  risk  for  a  fixed  family  V.  We  are  not  aware  of  any  other  results  in  the 
literature  that  certify  optimality  of  the  MLE  under  our  measure  of  risk. 

The  proof  of  this  theorem  is  based  on  the  observation  that  the  point-wise  risk  1Z3{V)  is  exactly 
1  —  Pj  [Af\  where  P;  is  the  gaussian  measure  centered  at  v3  and  A3  is  a  particular  polytope  based 
on  the  Voronoi  Tessellation  of  the  point  set  V.  We  use  this  characterization  and  the  unitary 
invariance  of  the  family  to  show  that  the  risk  landscape  for  the  MLE  is  constant  across  the 
hypotheses  v3.  Finally,  we  employ  a  dual  characterization  of  the  minimax  risk,  to  show  that  if 
the  risk  landscape  is  constant,  then  the  MLE  must  be  optimal. 

In  problems  where  Theorem  5.4  can  be  applied,  we  now  have  a  complete  story  for  the  isotropic 
case.  We  know  that  the  MLE  exactly  achieves  the  minimax  risk  and  Theorem  5.1  also  gives 
satisfactory  upper  and  lower  bounds.  However,  many  problems  of  interest  do  not  have  unitarily 
invariant  structure,  and,  in  many  of  these  problems,  the  MLE  is  suboptimal. 

To  improve  on  the  MLE  in  these  settings  we  now  develop  an  algorithm  for  finding  a  better 
estimator.  Our  approach  is  to  optimize  over  the  space  of  priors  ir  in  an  iterative  fashion,  starting 
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Algorithm  12  PriorOpt(V) 

Initialize  7r(j)  =  1/M  for  each  j  e  [M], 

For  each  j  e  [M],  compute  g3  =  P;  [A/  where  A}  =  {x  e  Wl  :  TjX  >  bj}  where  T,  has 
Vj  —  Vk  in  the  kth  row  and  bj  has  ||u,j|2/2  —  ||vfc||2/2  +  \og(nk/nj)  in  the  kth  entry, 
while  Entropy  of  —  is  far  from  log  M.  do 

2^j=i  9j 

Let  jmin  =  argmirij  g3  and  jmax  =  argrriax^  gr 

Update  7r(jmin)  =  7r(jmin)  +  7r(jmax)  =  tt (jmax)  -  Vt  for  some  step  size  rjt. 

Recompute  g3  using  new  prior. 

end  while 


at  the  uniform  prior  7r0,  whose  Bayes  estimator  is  the  MLE.  At  each  iteration,  we  compute  the 
risk  functional  of  the  current  Bayes  estimator,  which,  by  Proposition  5.2,  is  related  to  gaussian 
volumes  of  a  collection  of  poly  topes.  These  gaussian  volumes  can  be  approximated  with  Monte 
Carlo  sampling.  We  find  the  hypothesis  jmin  and  jmax  with  lowest  and  highest  risk  respectively 
and  adjust  the  prior  by  shifting  mass  from  jmin  to  jmax.  The  aim  is  to  move  the  prior  so  as  to 
flatten  out  the  risk  functional.  See  Algorithm  12  for  a  more  precise  sketch. 

The  algorithm  is  based  on  zero-th  order  ascent  of  the  entropy  of  a  particular  distribution.  The 
distribution  is  the  normalized  risk  functional,  and  the  the  parameter  space  is  the  prior  distribution 
7 r  on  the  hypothesis.  By  maximizing  entropy,  we  aim  to  make  this  distribution  uniform  which 
amounts  to  making  the  risk  functional  constant.  By  Proposition  5.3,  this  would  lead  to  a  minimax 
optimal  estimator.  Specifically,  the  algorithm  aims  to  solve  the  following  program: 

maximizeTeAM_1-H’(^i,)r,  9i, ■*,  •  •  • ,  9m, n)  (5.6) 

where  gj.n  =  P j[Ajt7T]  and  Ah7X  is  the  polytope  given  in  Proposition  5.2  for  prior  7r  and  H  : 
Wi/_r  — y  M  is  the  entropy  functional  after  normalizing  the  argument  to  lie  on  the  simplex. 

Unfortunately,  it  is  not  clear  whether  Program  5.6  is  convex  in  the  parameter  7r.  The  main  chal¬ 
lenge  in  analyzing  this  program  is  that  the  point-wise  risks  g3,v  involve  the  gaussian  volume  of 
arbitrary  polyhedral  sets,  and  these  are  not,  in  general,  analytically  tractable  quantities.  Therefore 
it  is  not  clear  how  the  parameter  7r  affects  the  objective  function  here,  which  precludes  analysis 
of  this  algorithm. 

The  gaussian  volumes  also  pose  a  computational  barrier,  as  even  computing  these  volumes  tend 
to  be  difficult.  While  there  has  been  research  on  approximating  the  gaussian  volume  of  a  convex 
set  [58],  these  algorithms  require  that  one  initially  knows  a  small  ball  contained  entirely  in  the 
set.  In  principle,  one  could  do  this  here  by  solving  a  linear  program  to  find  a  point  in  the  interior 
of  the  polytope,  but  we  find  that  Monte  Carlo  sampling  is  significantly  more  straightforward. 
Monte  Carlo  sampling  seemed  to  work  well  for  problems  in  moderate  dimension. 

So  while  this  algorithm  is  not  known  to  have  convergence  guarantees,  iterates  do  tend  to  have 
higher  entropy  and  therefore  have  more  uniform  risk  landscapes.  This  means  that  even  though 
the  algorithm  does  not  necessarily  find  the  least  favorable  prior,  it  does  lead  to  a  prior  whose 
Bayes  estimator  is  an  improvement  over  the  maximum  likelihood  estimator,  except  for  in  cases 
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Figure  5.1:  Example  structured  normal  means  problem  on  nine  points  in  two  dimensions.  Left: 
polyhedral  acceptance  regions  of  MLE.  Center:  Acceptance  regions  of  Bayes  estimator  from 
the  optimized  prior  computed  by  Algorithm  12.  Right:  Success  probability  landscape  (success 
probability  for  each  hypothesis)  for  the  two  estimators,  demonstrate  that  the  optimized  estimator 
has  better  minimax  risk. 


where  the  MLE  is  optimal.  For  many  problems,  it  is  therefore  worth  running  even  a  few  iterations 
of  this  algorithm  to  obtain  a  slightly  better  estimator. 

One  interpretation  of  this  algorithm  is  in  terms  of  regularization.  The  prior  computed  can  be 
viewed  as  a  regularizer  and  the  ensuing  Bayes  estimator  can  be  viewed  as  a  regularized  MLE. 
With  this  lens,  Algorithm  12  can  be  thought  of  as  computing  a  good  regularizer  for  the  struc¬ 
tured  normal  means  problem  defined  by  V.  Unfortunately,  we  have  no  rigorous  guarantees  on 
Algorithm  12,  although  we  hope  this  interpretation  can  influence  future  work  on  choosing  regu¬ 
larizes. 

In  Figure  5.1,  we  demonstrate  this  algorithm  and  compare  against  the  MLE.  The  example  has 
nine  points  in  two  dimensions  (which  enables  visualization)  and  the  left  panel  shows  the  polyhe¬ 
dral  acceptance  regions  of  the  MLE.  The  central  panel  shows  the  acceptance  region  of  the  Bayes 
estimator  computed  by  Algorithm  12  and  the  right  panel  shows  the  risk  landscape  of  these  two 
estimators.  Specifically,  in  the  third  panel,  the  x-axis  corresponds  to  the  nine  hypotheses,  and  the 
lines  denote  Pj[A,],  which,  as  we  saw,  is  just  one  minus  the  risk  for  hypothesis  j.  The  minimax 
risk  is  therefore  one  minus  the  minimum  value  on  these  curves. 

Notice  that  the  risk  landscape  of  the  optimized  estimator  is  essentially  constant  which  roughly 
certifies  that  it  is  minimax  optimal  (By  Proposition  5.3).  More  qualitatively,  the  minimax  risk 
of  this  optimized  estimator  is  significantly  better  than  that  of  the  MLE.  The  reason  for  this  is 
that,  under  the  MLE,  the  acceptance  region  for  the  central  hypothesis  is  very  small,  so  the  MLE 
has  low  acceptance  probability  for  that  hypothesis.  The  optimized  estimator  uses  an  expanded 
acceptance  region  for  this  hypothesis  which  increases  the  acceptance  probability  and  decreases 
the  minimax  risk.  Of  course,  this  comes  at  the  cost  of  decreasing  the  acceptance  probability  for 
other  hypotheses,  which  leads  to  a  flattening  of  the  risk  landscape. 
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5.2.3  The  Experimental  Design  Setting 


Recall  the  experimental  design  setting,  where  the  statistician  specifies  a  strategy  B  e  Ml  and 
receives  observation  y  ~  LP^/j  given  by  Equation  5.2.  Our  main  insight  is  that  the  choice  of  B 
only  changes  the  metric  structure  of  Wl:,  and  this  change  can  be  incorporated  into  the  proof  of 
Theorem  5.1.  Specifically,  the  likelihood  for  hypothesis  j,  under  sampling  strategy  B  is: 

^MB)  =  II  -  y (O)2/2) 

i=l 

and  the  maximum  likelihood  estimator  is: 

TMLE(y ,  B )  =  argmin  \\vj  -  y\\2B 

j  £  [M] 

where  INI!  =  Y2i= i  v(i)2B(i)  is  the  Mahalanobis  norm  induced  by  the  diagonal  matrix  diag (£>). 

Theorem  5.1  can  be  ported  directly  to  this  setting,  leading  to  the  following: 

Theorem  5.5.  Fix  6  G  (0, 1)  and  any  sampling  strategy  B  with  1 1 7 i  1 1 1  <  r.  Define  the  Sampling 
Exponentiated  Distance  Function  SEDF: 

W(V,  a ,  B)  =  max  V  exp  ( 
is  [a/]  V 

IfW(V,  8 ,B)<6  then  K(V,  r)  <  72(V,  TMLE(y ,  B ))  <  5.  Conversely,  ifW(V ,  2(1  -5),B)> 
-  1,  then  infT  swpje[M]  P j,B[T{y)  ^  j }  >  5. 

The  structure  of  the  theorem  is  almost  identical  to  that  of  Theorem  5.1,  but  it  is  worth  making 
some  important  observations.  First,  the  theorem  holds  for  any  non-interactive  sampling  strategy 
B  e  M+,  so  the  upper  bound  is  strictly  more  general  than  Theorem  5.1.  Secondly,  any  non¬ 
interactive  strategy  can  be  used  to  derive  an  upper  bound  on  the  minimax  risk,  but  the  same  is 
not  true  for  the  lower  bound.  Instead  the  lower  bound  provided  by  the  theorem  is  dependent  on 
the  strategy,  so  one  must  still  minimize  over  sampling  strategies  to  lower  bound  TZ(V,  r).  Note 
that  this  theorem  also  applies  to  the  non-isotropic  or  heteroscedastic  case  with  known,  shared 
covariance. 


Fortunately,  the  SEDF  is  convex  in  B  so  it  can  be  numerically  minimized  over  the  polyhedron 
{z  :  0  <  Zi  <  l,E?=i  Zi  <  B}.  Specifically,  for  any  a,  we  solve  the  convex  program: 


minimizeBgRd  j||S||1<T  max  ^  exp 


( -\\Vj-Vk\\B 
\  a 


(5.8) 


To  obtain  the  sampling  strategy  B  that  minimizes  the  SEDF.  For  example,  solving  Program  5.8 
with  a  =  1  results  in  a  strategy  B ,  and  if  W (V,  1,  B)  >  3,  then  we  know  that  the  minimax  risk 
1Z(V,  t )  over  all  strategies  is  at  least  1/2.  On  the  other  hand,  solving  with  a  =  8  to  obtain  a  (dif¬ 
ferent)  sampling  strategy  B  and  then  using  B  with  the  MFE  would  give  the  tightest  upper  bound 
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on  the  risk  attainable  by  our  proof  technique.  In  Section  5.3,  we  demonstrate  an  example  where 
this  optimization  leads  to  a  non-uniform  sampling  strategy  that  outperforms  uniform  sampling. 

In  the  general  setting,  it  is  challenging  to  analytically  certify  that  an  allocation  strategy  B  mini¬ 
mizes  the  SEDF,  but  in  some  cases  it  is  possible.  Specializing  the  first-order  optimality  conditions 
for  Program  5.8  to  our  setting  gives  the  following: 

Proposition  5.6.  Let  B  be  an  sampling  strategy  with  ||5||i  =  r.  Let  S(B)  c  V  be  the  set  of 
hypotheses  achieving  the  maximum  in  W (V,  a,  B)  and  let  n  be  a  distribution  on  S(B).  If,  for  all 
i,i'  e  [d\, 

%~»r  exp(— ||ufc  -  VjWb)  =  y^(vk(i')  -  Vj{i'))2  exp(-||ufc  -  Vj\\ |), 

k^j  ky=j 

then  B  is  a  minimizer  ofW(V,  a ,  B )  subject  to  \\  I j\\  1  <  r. 

While  application  of  this  result  could  involve  a  number  of  non-trivial  calculations,  there  are  many 
cases  where  it  does  lead  to  analytic  lower  bounds  for  particular  problems.  Specifically,  the  result 
is  especially  useful  when  B  is  uniform  across  the  coordinates,  and  S(B)  =  [At],  so  that  all  of  the 
hypotheses  achieve  the  maximum.  In  this  case,  it  often  suffices  to  choose  n  to  be  uniform  over 
the  hypotheses  and  exploit  the  high  degree  of  symmetry  to  demonstrate  the  condition  holds.  As 
we  will  see  in  Section  5.3,  many  examples  studied  in  the  literature  exhibit  the  requisite  symmetry 
for  this  proposition  to  be  applied  in  a  straightforward  manner. 

We  remark  that  Tanczos  and  Castro  [16]  establish  a  similar  sufficient  condition  for  the  uniform 
sampling  strategy  to  be  optimal.  Their  result  however  is  slightly  less  general  in  that  it  only 
certifies  optimality  for  the  uniform  sampling  strategy,  whereas  ours,  in  principle,  can  be  applied 
more  universally.  In  addition,  their  result  applies  only  to  problems  where  the  hypotheses  are 
of  the  form  pi $  for  a  collection  of  subsets  while  ours  is  more  general,  and  this  generality  is 
important  for  some  examples  (e.g.,  the  hierarchical  clustering  example  in  Section  5.3).  The  other 
main  difference  is  that  their  approach  is  not  based  on  the  SEDF,  so  their  result  is  not  directly 
applicable  here. 


5.3  Examples 


To  demonstrate  the  scope  of  our  results,  we  present  four  instantiations  of  structured  normal 
means  problems,  and  derive  results  easily  attainable  from  our  general  approach.  These  examples 
have  the  asymptotic  flavor  described  before,  where  we  are  interested  in  how  a  signal  strength 
parameter  p  scales  with  a  sequence  of  problem  instances.  To  simplify  presentation,  we  state  the 
results  in  terms  of  the  minimax  rate  and  use  the  notation  p  x  L  where  A  is  a  function  that 
depends  on  the  parameters  of  the  sequence  (e.g.,  the  dimension).  This  notation  means  that  if 
p  =  then  the  minimax  risk  can  be  driven  to  zero  and  conversely,  if  p  =  o(l)f>,  then  the 

minimax  risk  approaches  one  asymptotically. 
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The  first  example,  the  A'-scts  problem,  is  well  studied,  and  as  a  warmup,  we  show  how  our  tech¬ 
nique  recovers  existing  results.  The  second  example  is  the  biclustering  problem;  this  problem  is 
interesting  because  there  is  polynomial  separation  between  non-interactive  and  interactive  proce¬ 
dures,  and  our  technique  can  be  used  to  establish  lower  bounds  on  all  non-interactive  approaches. 
The  third  example  is  a  graph-structured  signal  processing  problem,  and  this  example  is  interest¬ 
ing  because  our  technique  generalizes  existing  results,  but  also  because  uniform  sampling  may 
not  be  optimal.  In  the  last  example,  we  use  Theorem  5.1  to  demonstrate  the  achievability  of  the 
channel  capacity  in  additive  white  gaussian  noise  (AWGN)  channel,  showing  how  an  easy  calcu¬ 
lation  can  reproduce  the  proof  of  Shannon  [155].  The  requisite  calculations  for  these  examples 
are  deferred  to  Section  5.5. 


5.3.1  k- sets 

In  the  k- sets  problem,  we  have  M  =  (f)  and  each  vector  v3  =  1  Sj  where  S3  C  [d\  and  \Sj\  =  k. 
The  observation  is  y  ~  Af(pvj,  Id)  for  some  hypothesis  j. 

Corollary  5.7.  The  minimax  rate  for  the  k-sets  problem  is  p  x  y/log(k(d  —  k ))  and  with  budget 
constraint  t,  it  is  p  x  ^jd  log {k{d  —  k )).  In  the  isotropic  case,  the  MLE  is  minimax  optimal. 

This  corollary  follows  simply  by  bounding  the  EDF  for  the  k- sets  problem  using  binomial  ap¬ 
proximations.  Using  Proposition  5.6,  it  is  easy  to  verify  that  uniform  sampling  is  optimal  here, 
which  immediately  gives  the  second  claim.  Finally  using  the  set  of  all  permutation  matrices 
and  exploiting  symmetry,  we  can  easily  verify  that  this  class  is  unitarily  invariant.  These  bound 
agrees  with  established  results  in  the  literature  [161]. 


5.3.2  Biclusters 

In  the  biclustering  problem,  we  instead  work  over  Wlxd  and  let  M  =  f'/) We  parametrize 
the  class  V  with  two  indices  so  that  v%3  =  I5T5  is  a  d  x  d  matrix  with  k2  non-zeros  with 
| Si |  =  \Sj\  =  k.  The  observation  is  y  ~  J\f  (jivec(vl3) ,  Id 2)  for  a  hypothesis  (i,  j). 

Corollary  5.8.  The  minimax  rate  for  the  biclustering  problem  is  p  x  h°Tkl^  Jn}_  an(j  wpp 

budget  constraint  t,  it  is  p  x  df  log (k(d  —  k )).  In  the  isotropic  case,  the  MLE  is  minimax 
optimal. 

Our  bounds  agree  with  existing  analyses  of  this  class  [39,  118,  161].  Obtaining  this  result  in¬ 
volves  simply  bounding  the  EDF  using  binomial  approximations  as  in  the  k- sets  example,  and 
straightforward  applications  of  Theorem  5.4  and  Proposition  5.6  with  the  uniform  distribution. 

The  biclustering  problem  is  interesting  because  a  simple  interactive  algorithm  has  significantly 
better  statistical  performance.  The  algorithm  first  samples  coordinates  of  the  matrix  randomly, 
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Figure  5.2:  Left:  A  realization  of  the  stars  problem  for  a  graph  with  13  vertices  and  34  edges 
with  sampling  budget  r  =  34.  Edge  color  reflects  allocation  of  sensing  energy  and  vertex  color 
reflects  success  probability  for  MLE  under  that  hypothesis  (warmer  colors  are  higher  for  both). 
Isotropic  (left)  has  minimum  success  probability  of  0.44  and  experimental  design  (center)  has 
minimum  success  probability  0.56.  Right:  Maximum  risk  for  isotropic  and  experimental  design 
sampling  as  a  function  of  /i  for  stars  problem  on  a  50  and  100-vertex  graph. 

with  enough  energy  so  as  to  reliably  test  if  a  coordinate  is  active  or  not,  until  it  finds  an  ac¬ 
tive  coordinate.  It  then  senses  on  the  row  and  column  of  that  coordinate  and  identifies  the  rows 
and  columns  that  are  active  in  the  bicluster.  Tanczos  and  Castro  [  6  ]  show  that  this  algorithm 

succeeds  if  p  —  c log ~d,  which  is  a  factor  of  Vk  smaller  than  the  lower  bound 
established  here,  demonstrating  concrete  statistical  gains  from  interactivity.  Note  that  this  sepa¬ 
ration  is  known  [161].  We  provide  a  crude  by  sufficient  analysis  of  this  interactive  algorithm  in 
Section  5.5. 


5.3.3  Stars 

Let  G  =  (V,  E)  be  a  graph  and  let  the  edges  be  numbered  1, . . . ,  d.  The  class  V  is  the  set  of  all 
stars  in  the  graph,  that  is  the  vector  Vj  £  {0,  l}d  is  the  indicator  vector  of  all  edges  emanating 
from  the  jth  node  in  the  graph.  Again  the  observation  is  y  A f{pVj,Id)  for  some  j  £  [|V|]. 
Corollary  5.9.  In  the  stars  problem  if  the  ratio  between  the  maximum  and  minimum  degree  is 
bounded  by  a  constant,  i.e.  ^fgmax  <  c,  then  the  minimax  rate  is  //  x 

Again  this  agrees  with  a  recent  result  of  Tanczos  and  Castro  [16  ],  who  consider  s-stars  of  the 
complete  graph,  formed  by  choosing  a  vertex,  and  then  activating  s  of  the  edges  emanating  out  of 
that  vertex.  The  two  bounds  agree  in  the  special  case  of  the  complete  graph  with  s  =  \V\  —  1,  but 
otherwise  are  incomparable,  as  they  consider  different  problem  structures.  Note  that  the  degree 
requirement  here  is  not  fundamental  in  Theorem  5.1,  but  rather  a  shortcoming  of  our  calculations. 

We  highlight  this  example  because  the  uniform  allocation  strategy  does  not  necessarily  minimize 
W (V,  a ,  B ).  In  Figure  5.2,  we  construct  a  graph  according  to  the  Barabasi- Albert  model  [5]  and 
consider  the  class  of  stars  on  this  graph.  The  simulation  results  show  that  optimizing  the  SEDF 
to  find  a  sampling  strategy  is  never  worse  than  uniform  sampling,  and  for  low  signal  strengths 
it  can  lead  to  significantly  lower  maximum  risk.  We  believe  the  increases  in  the  right-most  plot 


iog(|y|-tfegmin) 
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are  caused  by  numerical  instability  arising  from  the  non-smooth  optimization  problem  5.8.  Note 
that  the  risk  for  both  uniform  and  non-uniform  sampling  approaches  zero  as  [i  — >  oo,  so  for  large 
/i,  there  is  little  advantage  to  optimizing  the  sampling  scheme. 


5.3.4  Random  Codes 

Consider  a  collection  V  of  M  vectors  with  coordinates  that  are  i.i.d.  A/"(0,P).  In  expectation 
over  the  draw  of  the  M  vectors,  the  Bayes  risk  of  the  maximum  likelihood  estimator  under  the 
uniform  prior  is: 

EvEj„Unif[Atf]Pj [error]  <  (M  -  1)(1  +  P/2)~d/2 

In  other  words  if  M  =  o((l  +  P/2)d'2),  then  the  maximum  likelihood  decoder  can  drive  the 
probability  of  error  to  zero  as  d  — *  oo. 

In  information  theoretic  terms,  this  quick  calculation  roughly  says  that  there  exists  a  rate  R  = 
log (M)/d  =  \  log(l+P/2)  —  u(l/d)  code  with  power  constraint  P,  that  can  be  reliably  transmit¬ 
ted  over  an  additive  white  noise  gaussian  (AWGN)  channel  with  noise  variance  1.  This  nearly 
matches  Shannon’s  Channel  Coding  theorem  [59]  which  says  that  the  rate  cannot  exceed  the 
channel  capacity,  which  in  our  case  is  \  log(l  +  P). 

There  are  two  small  weaknesses  of  this  calculation  in  comparison  with  the  classical  achievability 
of  the  channel  capacity.  The  first  is  that  our  bound  involves  the  term  log(l  +  P/2)  instead  of 
log(l  +  P)  in  the  definition  of  channel  capacity.  We  suspect  this  is  due  to  weakness  in  our 
bounding  technique  in  Theorem  5.1,  which  in  part  allows  for  significantly  more  generality  than 
this  special  case.  The  second  is  that  the  codewords  we  use  are  drawn  from  A/"(0,  P)  so  they 
will  exceed  the  power  constraint  ||u|||  <  P  with  constant  probability.  This  shortcoming  can  be 
remedied  by  instead  using  A/"(0,  P  —  c/d)  and  applying  well  known  y2  deviation  bounds. 


5.4  Discussion 


In  this  chapter,  we  studied  the  structured  normal  means  problem  and  gave  a  unified  characteri¬ 
zation  of  the  minimax  risk  both  for  isotropic  and  experimental  design  settings.  Our  work  gives 
insights  into  how  to  choose  estimators  (e.g.,  the  optimality  certificate  for  the  MLE)  and  how 
to  design  sampling  strategies  for  structure  recovery  problems.  Our  lower  bounds  are  critical  in 
demonstrating  separation  between  non-interactive  and  interactive  sampling,  which  is  an  impor¬ 
tant  research  direction. 

There  are  a  number  of  exciting  directions  for  future  work,  including  extensions  to  other  structure 
discovery  problems  such  as  detection,  and  to  other  observation  models,  such  as  compressive 
observations.  We  are  most  interested  in  developing  a  unifying  theory  for  interactive  sampling, 
analogous  to  the  theory  developed  here.  The  challenges  with  developing  such  an  understanding 
are  both  algorithmic  and  information  theoretic,  and  we  are  excited  to  tackle  these  challenges. 
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5.5  Proofs 


5.5.1  Proof  of  Theorem  5.1 


Analysis  of  MLE:  We  first  analyze  the  maximum  likelihood  estimator: 

TMle(v )  =  argmin  \\v3  -  y\\\ 

j  S  [M] 

This  estimator  succeeds  as  long  as  \\vk  —  y |||  >  \\v3*  —  y\\\  for  each  k  j*,  when  y  ~  P^*.  This 
condition  is  equivalent  to: 

IK-  —  2/111  >  —  2/111  (e,vk  -vj*)  <  -\\vj*  ~vk\\l 

where  e  ~  A/"(0, 1).  This  follows  from  writing  y  =  Vj*  +  e  and  then  expanding  the  squares.  So 
we  must  simultaneously  control  all  of  these  events,  for  fixed  j *: 

P6~A/-(o,/d)  [Vfc  f  j*.(e,vk  -  Vj*)  <  ||uj*  -  vk\\l/2] 

=  1  -  Pe~AT(o,/d)  [3 k  f  j*.{e,vk  -  Vj*)  >  II Vj*  -  vk\\l/2] 

>  1  -  Pe~Af(o ,id)  [(e,ufc  -  Vj*)  >  || Vj*  -  vk\\l/2] 

kj^j* 


By  a  gaussian  tail  bound,  this  probability  is: 


P, 


W(o 


,id)  [(e,  Vfc  -  Vj*)  >  || Vj*  -  vk\\l/2]  <  exp  --||n,*  -  vk 


So  that  the  total  failure  probability  is  upper  bounded  by: 

1„ 


Wj*d=3 1  <  I]  exp 

k^j* 


-Vkh  f  =  Wj*{V,  8) 


So  if  j  is  the  truth,  then  the  probability  of  error  is  smaller  than  5  when  W3{V ,  8)  <  S.  For  the 
maximal  (over  hypothesis  choice  j )  probability  of  error  to  be  smaller  than  S,  it  suffices  to  have 

W(V,8)<5. 


Fundamental  Limit:  We  now  turn  to  the  fundamental  limit.  We  start  with  a  version  of  Fano’s 
inequality  with  non-uniform  prior. 

Lemma  5.10  (Non-uniform  Fano  Inequality).  Let  0  =  {9}  be  a  parameter  space  that  indexes  a 
family  of  probability  distributions  Pg  over  a  space  X.  Fix  a  prior  distribution  it,  supported  on  0 
and  consider  6  ~  tt  and  X  ~  Pg.  Let  f  :  X  — >■  0  be  any  possibly  randomized  mapping,  and  let 
pe  =  P  e~n,x~pe  [./'(A)  f  9]  denote  the  probability  of  error.  Then: 

^  'The  'k(9)KL(Po\\Pw)  +  log  2 
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where  Pn(-)  =  Ed^nPd(-)  is  the  mixture  distribution.  In  particular,  we  have: 

inf  sup  P  x~Pe[f(X)  ±  9}  >  inf  E^P  Wfl[/(*)  ^  6}  >  1  -  T,e<0)KL(^e\\P,)  +  log2 

f  0  f  P{-k) 

Proof.  Consider  the  Markov  Chain  9  — >  X  —?  0  =  f(X)  where  9  ~  7r  and  X\6  ~  Pq.  Let 
E  =  l[9f  9\. 

H(E\X)  +  H(9\E,  X)  =  H(E,9\X)  =  H(9\X)  +  H(E\9,X)  >  H(9\X) 

Now,  H(E\9,  X)  >  0  and  since  conditioning  only  reduces  entropy,  we  have  the  inequality 

H(0\X)  <  H(pe)  +  H(9\E,X)  =  H(pe)  +  H(9\E  =  0 ,X)P[E  =  0]  +  H(9\E  =  1  ,X)P[E  =  1] 
=  H(pe)+peH(9) 

which  proves  the  usual  version  of  Fano’s  inequality.  We  want  to  write  H(9\X)  in  terms  of  the 
KL  divergence,  using  the  mixture  distribution  Pn. 

H(e \X)  =  H(9,X)  -  H(X)  =  I  y>(9)P„0r)log  dx 

=  E"  —  [ po(x)l°s dx-^2-K(e)i°giT(e) 

9  •*  \  9\  )  /  e 

=  -J2^(9)KL(Pe\\Pn)  +  H(n) 

9 

Combining  these  gives  the  bound: 

H(pe)  +PeH(n)  >  H(-k)  -J2<e)KL(Pe\\Pn), 

9 

By  upper  bounding  H(pe )  <  log  2  and  rearranging  we  prove  the  claim. 

For  a  distribution  n  e  Am_i  over  the  hypothesis,  let  I\  ( • )  =  Tp,  I\  ( • )  be  the  mixture  dis¬ 

tribution.  Then  Fano’s  inequality  (Lemma  5.10)  states  that  the  minimax  probability  of  error  is 
lower  bounded  by: 

PW)  =  inf  supP j[T(y)  f  j\  >  inf  Ej^7VEy^jl[T(y)  f  j\ 

T  j  T 

^  E wiCL(Pfe  | \Pn)  +  log  2 

r) 

Fix  5  G  (0, 1)  and  let  j*  =  argmaxj£rM,  Wj  (2(1  —  5)).  We  will  use  a  prior  based  on  this  quantity: 

7r*<XeXp  {-  2(1 -S)  ) 
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With  this  prior,  the  entropy  becomes: 


H (vr)  =  log 


EiexP 


\\Vj*—Vi\\2 

2(1-5) 


( 

exp  I  - 

2(1-5)  ) 

log(W(V,2(l-5))  +  l)  +  J2 


^ k 


| Vj*  ~  Vk\\2 
2(1-5) 


log(lF(V, 2(1  -  5))  +  1)  +  Y,^KL{Pk\\P, 


1-5 

k, 

The  1  inside  the  first  log  comes  from  the  fact  that  in  the  definition  Wj+,  we  do  not  include  the  term 
involving  j*  in  the  sum,  while  our  prior  7r  does  place  mass  proportional  to  1  on  hypothesis  j*. 

The  term  involving  the  KL-divergence  follows  from  the  fact  that  the  KL  between  two  gaussians 
>2 
2" 


is  one-half  the  -distance  between  their  means. 


Looking  at  the  lower  bound  from  Fano’s  inequality,  we  see  that  if: 

Ek^KL{Pk\\P„)  +  log 2  <  (1  —  6)H(i r)  =  (1  -  5)  log(PF(V,2(l  -  5))  +  1)  +  ^ nkKL(Pk\ \Pr 

k 

then  the  probability  of  error  is  lower  bounded  by  5.  Of  course  it  is  immediate  that: 

Pk(x)Pn(x 

^k  I  l'fe(X-)  log  I 
k 


Y;'KkKL{Pk\\Pj*)  =  J2nkJ  loS  \pjx)pjr(x 

=  77 k  J  1o§  ]rj“j  +  J  ^kPk{x)  log 


PAx) 


J2^kKL(Pk\\Pn)  +  KL{Pv\\Pj*)  >  EkKL(Pk\\Pn) 


So  the  condition  reduces  to  requiring  that: 

log 2  <  (1  —  5)  log(PF (V,  2(5  —  1)  +  1). 
After  some  algebra,  this  is  equivalent  to: 


W(V,  2(5-1))  >2^-1 


5.5.2  Proof  of  Theorem  5.5 

The  proof  of  Theorem  5.5  is  essentially  the  same  as  the  proof  of  Theorem  5.1,  coupled  with  two 
observations.  First,  for  a  sampling  strategy  B  e  Ml  the  maximum  likelihood  estimator  is: 

TMLE(y,  B )  =  argmin  || Vj  -  y ||| 

j£[M] 
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so  the  analysis  of  the  MLE  depends  on  the  Mahalanobis  norm  ||  •  ||B  instead  of  the  (2  norm. 


Similarly,  the  KL  divergence  between  the  distribution  P j  B  and  Pfc  B  depends  on  the  Mahalanobis 
norm  ||  •  ||B  instead  of  the  norm.  Specifically,  we  have: 

KL(¥j,B\\n,B)  =  ^  II V3  ~Vk\\2B- 

The  lower  bound  proof  instead  use  this  metric  structure,  but  the  calculations  are  equivalent. 


5.5.3  Proof  of  Proposition  5.6 

To  simplify  the  presentation,  let  f(B)  =  W (V,  a,  B).  f(B )  is  convex  and  (strictly)  monotoni- 
cally  decreasing,  so  we  know  that  the  minimum  will  be  achieved  when  the  constraint  is  tight,  i.e. 
when  1 1  / i 1 1  ]  =  r.  The  Lagrangian  is: 

C{B,X)=f(B)  +  X(\\B\\1-r) 

and  the  minimum  is  achieved  at  B ,  with  ||-B||i  =  r,  if  there  is  a  value  A  such  that  0  €  d£(B,  A). 
Observing  that  the  subgradient  is  df(B )  +  Al,  it  suffices  to  ignore  the  Lagrangian  term  and 
instead  ensure  that  df(B )  oc  1.  f(B)  is  a  maximum  of  M  convex  functions,  where  fj(B)  is 
the  function  corresponding  to  hypothesis  Vj,  and,  by  direct  calculation,  the  subgradient  of  this 
function  fj(B)  is: 


OB, 


-MO  -  M))2exp(-lk  -  Vj\\ |). 


Moreover,  the  subgradient  of  the  maximum  of  a  set  of  functions  is  the  convex  hull  of  the  subgra¬ 
dients  of  all  functions  achieving  the  maximum.  This  means  that  if  there  exists  a  distribution  7r, 
supported  over  the  maximizers  of  f(B),  such  that  the  expectation  of  the  subgradients  is  constant, 

we  have  certified  optimality  of  B.  This  is  precisely  the  condition  in  the  Proposition. 


5.5.4  Proof  of  Theorem  5.4 

Proof  of  Proposition  5.2:  To  prove  Proposition  5.2,  we  make  two  claims.  First  we  certify  that 
for  a  prior  i r,  the  Maximum  a  Posteriori  (MAP)  estimator  is  a  Bayes  estimator  for  prior  n.  Given 
7 r,  the  map  estimator  is: 


Tv{y)  =  argmax  7r (j )  exp {  —  1 1 ry  -  y\\  1/2} 

3 
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Define  the  posterior  risk  of  an  estimator  T  to  be  the  expectation  of  the  loss,  under  the  posterior 
distribution  on  the  hypothesis.  In  our  case  this  is: 

M 

r(T\ y)  =  5^1  [T(y)  ^  j]Tr(j\y)  where  ir(j\y)  oc  n{j)  exp{-||vj  -  y\\%/2}. 

3= 1 

For  a  fixed  y,  this  quantity  is  minimized  by  letting  T(y)  be  the  maximizer  of  the  posterior,  as  this 
makes  the  0  —  1  loss  term  zero  for  the  largest  n(j\y)  value.  Thus  for  each  y  we  minimize  the 
posterior  risk  by  letting  T(y)  be  the  MAP  estimate.  The  result  follows  by  the  well  known  fact 
that  if  an  estimator  minimizes  the  posterior  risk  at  each  point,  then  it  is  the  Bayes  estimator. 

This  argument  shows  that  the  only  types  of  estimators  we  need  to  analyze  are  MAP  estimators 
under  various  priors.  This  gives  us  the  requisite  structure  to  prove  Proposition  5.2. 

Specifically,  for  a  prior  tt,  for  the  MAP  estimate  to  predict  hypothesis  j,  it  must  be  the  case  that: 

Wk  t £  j.  TTj  exp{  — ||uj  —  2/111/2}  >  nk  exp{-||u,-  -vk\\l/2}. 

This  can  be  simplified  to: 

(vj-vk,y)  >  h\\vj\\l  -  \\vk\\l)  +  log^. 

Z  TTj 

Thus  the  acceptance  region  for  the  hypothesis  j  is  the  set  of  all  points  y  that  satisfy  all  of  these 

M  —  1  inequalities.  This  is  exactly  the  polyhedral  set  Aj. 

Proof  of  Proposition  5.3:  We  provide  a  proof  of  this  well-known  result  showing  that  the  Bayes 
estimator  with  uniform  risk  landscape  is  minimax  optimal.  Let  Tn  be  the  Bayes  estimator  under 
prior  tt  and  let  T0  be  some  other  estimator.  Since  Tn  has  constant  risk  landscape,  we  know  that 
rriaxj  TZ j(V,Tn)  =  Bn(Tn),  or  the  minimax  risk  for  Tn  is  equal  to  its  Bayes  risk.  We  know  that 
the  Bayes  risk  of  T0  is  at  most  the  minimax  risk  for  T0,  i.e.  Bn(T0)  <  max.,-  72,  (V.  T0).  If  it  were 
the  case  that  T0  had  strictly  lower  minimax  risk,  then  we  have: 

B-k(Tq)  <  max 72, (V,  T0)  <  max72./-(V,  Tn)  <  Bn(Tn). 
j  j 

However,  this  is  a  contradiction  since  Tn  is  the  Bayes  estimator  under  prior  tt,  meaning  that  it 

minimizes  the  Bayes  risk. 

Proof  of  Theorem  5.4:  Our  goal  is  to  apply  Proposition  5.3.  By  the  fact  that  72j(V,  T)  —  1  — 
Pj[A,]  where  Aj  is  T’s  acceptance  region  for  hypothesis  j,  we  must  show  that  the  P j  probability 
content  of  the  acceptance  regions  are  constant.  Ignoring  the  normalization  factor  of  the  gaussian 
density,  this  is: 


x\\l/2}dx, 
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where  Aj  =  {z\YjZ  >  bj }  as  defined  in  Proposition  5.2.  We  will  exploit  the  unitary  invariance 
of  the  family. 

For  any  pair  of  hypothesis  ],  k,  let  Rjk  be  the  orthogonal  matrix  such  that  Vk  =  RjkVj  and  note 
that  Rkj,  the  orthogonal  matrix  that  maps  Vk  to  v3,  is  just  Rjk.  This  also  means  that  RjkRjk  = 
RjkRkj  =  I-  Via  a  change  of  variables  x  =  RkjV,  the  integrand  becomes: 

exp{  —  \\vj  -  RkjV ||l/2}  =  exp{  —  \\ RjkVj  -  RjkRkjV |||/2}  =  exp{-||t-fe  -  y ||l/2}. 

Thus,  we  have  translated  to  the  Pk  measure. 

As  for  the  region  of  integration,  first  note  that  since  v%  =  RjiVj,  it  must  be  the  case  that  1 1  Vj  \ \ \  = 

1 1 u j 1 1 2  for  all  j,  i  e  [M],  This  means  that  the  vector  bj  defining  the  acceptance  region,  which  for 
the  MLE  has  coordinates  b3(i)  =  |(||ujH|  —  ||ui|||),  is  just  the  all-zeros  vector.  The  region  of 
integration  is  therefore: 


{z\Y jZ  >  0}  =  {z\T jRkjZ  >  0}. 


We  must  check  that  this  polytope  is  exactly  Ak,  which  means  that  we  must  check  that  for  each  i, 
(vj  —  Vi)TRkj  is  a  row  of  the  T /,  matrix.  But: 

(v3  -  Vi)T Rkj  =  vjRjk  -  vj Rjk  =  vl-  vfRjk. 

Since  vt  can  generate  the  family  V,  it  must  be  the  case  that  RjkVi  G  V  so  that  this  difference  does 
correspond  to  some  row  of  Tk-  Since  we  apply  the  same  unitary  operator  to  all  of  the  rows,  it  must 
be  the  case  that  the  number  of  distinct  rows  is  unchanged,  or  in  other  words,  there  is  a  bijection 
from  the  rows  in  T  ;  Rk;]  to  the  rows  i  n  T /. .  Therefore,  the  transformed  region  of  integration,  after 
the  change  of  variable  x  =  RkjV  is  exactly  the  acceptance  region  Ak,  and  the  integrand  is  the  Pfc 
measure.  This  means  that  Pfc[Afc]  =  P';  [/1?]  and  this  is  true  for  all  pairs  ( j ,  k ),  so  that  the  risk 

landscape  is  constant.  By  Proposition  5.3,  this  certifies  optimality  of  the  MLE. 


5.5.5  Calculations  for  the  examples 


Calculations  for  A’-Sets:  We  must  upper  and  lower  bound  W (V,  a).  First  note  that  by  symmetry, 
every  hypothesis  achieves  the  maximum,  so  it  suffices  to  compute  just  one  of  them: 

W(V,a )  =  exp  (~\\vk  ~  Vj\ \l/a)  =  ^  (d  ^  exp(-2 sy2/a). 

k^j  s=  i  Vs/ 

This  follows  by  noting  that  the  distance  between  two  hypothesis  is  the  symmetric  set  difference 
between  the  two  subsets,  and  then  by  a  simple  counting  argument.  Using  well  known  bounds  on 
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binomial  coefficients,  we  obtain: 

k 

W(V,a)  <  ^exp(.slog(/ce/s)  +  slog((<i  —  k)e/s )  —  2 s/j2/a) 

s= 1 
k 

=  ^  exp(s  log  (e2k(d  —  k)/s 2)  —  2  s/i2/a) 

S=  1 

<  fcexp(loge2/c(d  —  k)  —  2/i2/a)  if  2 /i2 /a  >  log(e2/c(d  —  /c)) 

This  is  smaller  than  5  whenever  \r  >  alog(efc(d  —  k)/5),  which  subsumes  the  requirement 
above.  For  the  lower  bound: 

k 

W (V,  a)  >  exp(s  log (k/s)  +  s  log((d  —  k)/s)  —  2 s/jl2 /a)  >  exp(— 2 /i2 /a  +  log (k(d  —  k))) 

S=  1 


which  goes  to  infinity  if  /r  =  o(alog(k(d  —  k ))). 

To  certify  that  the  uniform  allocation  strategy  minimizes  W(V,  a ,  B ),  we  apply  Proposition  5.6. 
Fix  t  and  let  B  be  such  that  B(i)  =  r/d.  By  symmetry,  every  hypothesis  achieves  the  max¬ 
imum  under  this  allocation  strategy,  and  we  will  take  7r  to  be  the  uniform  distribution  over  all 
hypothesis. 

Q  j:  /  D\ 

For  a  hypothesis  j  and  a  coordinate  i,  the  sub  gradient  -jjjrjy  at  B,  depends  on  the  whether 
Vj(i)  =  0  or  not.  If  Vj(i)  =  0,  then: 


dB{i) 


I1 


d  —  k  —  1 
s  —  1 


(fc  -  s)  exp(-2 Tfi2s2/d), 


and  if  Vj(i)  =  f  r  then: 


dB{i) 


exp(— 2  ri?s2  jd). 


Both  of  these  follow  from  straightforward  counting  arguments.  Notice  that  the  value  of  the 
subgradient  depends  only  on  whether  Vj(i')  =  0  or  not,  and  under  the  uniform  distribution  n, 
Kj^nVj(i)  =  E This  implies  that  the  constant  vector  is  in  the  subgradient  of  f(B)  at  B, 
so  that  B  is  the  minimizer  of  W (V,  cc,  B )  subject  to  ||-B||i  <  r. 

We  have  already  done  the  requisite  calculation  to  bound  the  minimax  risk  under  sampling. 
The  calculations  above  show  that  if  /i  =  u(  \J ^  log (k(d  —  k)))  then  the  maximum  likelihood 
estimator,  when  using  the  uniform  sampling  strategy  has  risk  tending  to  zero.  Conversely  if 
/i  =  o(  w  ^  log (k(d  —  k)  )  )  then  the  minimax  risk,  for  any  allocation  strategy  tends  to  one. 
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Calculation  for  Biclusters:  Due  to  symmetry,  all  hypotheses  achieve  the  maximum  and  there¬ 
fore,  we  can  directly  calculate  W(V,  a).  We  use  the  notation  Cln  to  denote  the  binomial  coeffi¬ 
cient  (”). 


W(V,a)  =  E  E  P  -  «o)  +  »«(*  -  »,)  +  »,»„)) 


S7- —  1  Sq  —  1 

k 


Sr  —  1 


+  Y  ckcd-k  exp  l~~(Srk^>)  +  5Z  cfccCd-fc  exp  ( 


Sc  =  l 


2/r2 


This  last  two  term  comes  from  the  case  where  sc  =  0  or  sr  =  0,  which  is  all  of  the  hypotheses 
that  share  the  same  columns  but  disagree  on  the  rows  (or  share  the  same  rows  but  disagree  on  the 
columns).  Using  binomial  approximations,  the  first  term  can  be  upper  bounded  by: 


k  k 


s  EE  exp  I  sr  log 


Sr  —  1  Sc  —  1 

k 


k(d  —  k)e2  ,  k(d  —  k)e 2  2 u2 

— 2 - V  Sc  log - 2 - —(sr(k  -  scj 2)  +  sc(k  -  sr/ 2)) 


a 


<  Y  eXP  (  Sr  (  log 

Sr  —  1 


k(d  —  k)e2  kfi2 


a 


y  exp  (sc(  log 


Sc  — 1 


k(d  —  k)e2  k/i- 


a 


The  two  terms  here  are  identical,  so  we  will  just  bound  the  first  one: 


Y  exp  (  sr  ( log 


Sr  —  1 


k(d  —  k)e2  k/iz 


a 


< 


Y]  exp  (sr  (log (k(d  —  k)e2)  —  k/ji2 /a )) 


Sf — X 


<  A;  exp  (log (k(d  —  k)e2)  —  k/j2 /a)  if  /i2  >  j  log (k(d  —  k)e2) 


Applying  this  inequality  to  both  terms  gives  a  bound  on  ID (V,  a).  This  bound  is  smaller  than  5 
as  long  as  fi  >  ■>/ ^  log (k(d  —  k)e/8)  for  some  universal  constant  c.  Again  this  subsumes  the 
condition  required  for  the  inequality  to  hold. 


The  other  two  terms  are  essentially  the  same.  Using  binomial  approximations,  both  expressions 
can  be  bounded  as: 


k  /  9  2  \ 

£  C^Ctk  exp  (  — — ( srk )  J  =  Y,  exp(sr  log(e2£;(<i  -  k)/s2)  -  2 srk/i2/a) 
sr= 1  \  a  J  V=1 

<  /cexp(log(/c(d  —  k)e2)  —  2  kfi2  /a)  if  n2  >  —  log  (kid  —  k)e2). 

2k 

These  bounds  lead  to  the  same  minimax  rate  as  above. 
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For  the  lower  bound,  we  again  use  binomial  approximations. 

W{V,a)  >  Y  Y  exp  ( Sr  bg  k ^  +  S^l0g  k ^  -  ^r(Sr(k  -  Sc/ 2)  +  Sc{k  -  sr/ 2)) 


Sr  —  1  Sc —  1 

k 


a 


>  Y  exp  (  M  log 


Sr  —  1 


k{d  —  k)e 2  2  k/x2 


r 

2  i  /  \2 


a 


exp  (  sc  (  log 

sc=l 


k(d  —  k)e2  2kfi2 


a 


>  exp(log(/c(d  —  k)  —  2 [i2k/a 


This  lower  bound  goes  to  infinity  if  n  —  oU  \  log (k(d  —  k)))  lower  bounds  the  minimax  rate. 


To  certify  that  the  uniform  allocation  strategy  minimizes  W(V,  a,  B),  we  apply  Proposition  5.6. 
Fix  r  and  let  B  be  such  that  B{(a,  b))  =  r/d2  for  all  (a,  b)  E  | r/]  x  [d].  By  symmetry,  every 
hypothesis  achieves  the  maximum  under  this  allocation  strategy,  and  we  will  take  ir  to  be  the 
uniform  distribution  over  all  hypothesis. 


For  a  hypothesis  j,  let  fj(B)  denote  the  term  in  the  SEDF  centered  around  j.  For  a  hypothesis 
j  based  on  clusters  Si,  Sr  and  a  coordinate  (a,  6),  the  subgradient  at  B(a,  b)  depends  on 

whether  a  E  Si  and  b  E  Sr.  If  a  /  Ci  and  b  /  Cr-  then: 

,2  k  k 


dfj(B ) 


dB(a,  b)  B=B  _  Sr=1Sc=1 

This  follows  by  direct  calculation.  Similar  calculations  yield  the  other  cases: 

dfj(B)  —2  —  — 


ZE 

a 


E  E  Ctl-lCk exp( 


—2  r/i- 
ad 2 


(sr(fc  -  sc/2)  +  -  sr/2))). 


dB(a,  b) 
dfj(B) 


~d 


dB(a ,  b) 
dfj(B) 


dB(a,  b) 


B=B 


B=B 


B=B 


a 


E  E  exp( 


l/i 

a 


sr=0  sc=l 
2  k  k—  1 


—2  t/j,z 
ad 2 

—2rfi2 

ad 2 


sr=l  sc=0 

^  E  E 


{Sr(k  sc)  +  SC(A’  Sr)  +  SrSc)). 
-  (sr(k  —  Sc)  +  Sc(fc  —  Sr)  H~  SrSc)). 
(sr(£:  -  Sc)  +  Sc{k  -  Sr)  +  SrSc)). 


sr=0  sc= 0 


These  correspond  to  the  cases  a  E  Shb  £  Sr,  a  £  Shb  E  Sr  and  the  case  where  a  E  Si,b  E  Sr 
respectively.  The  main  point  is  that  the  value  of  the  subgradient  depends  only  on  presence  or 
absence  of  the  row/column  in  the  cluster,  and  under  the  uniform  distribution  i r,  each  row/column 
is  equally  likely  to  be  in  the  cluster.  This  means  that  for  every  coordinate  (a,  b)  taking  the 
expected  subgradient  with  respect  to  the  uniform  distribution  over  hypotheses  yields  the  same 
expression.  So  the  constant  vector  is  in  the  subgradient  of  f(B )  at  B,  so  that  B  is  the  minimizer 
of  W(y,a,B)  subject  to  ||-B||i  <  r. 


We  have  already  done  the  requisite  calculation  to  bound  the  minimax  risk  under  sampling.  The 
calculations  above  show  that  if  /j,  —  oj( \J ^  log (k(d  —  k)))  then  the  maximum  likelihood  es¬ 
timator,  when  using  the  uniform  sampling  strategy  has  risk  tending  to  zero.  Conversely  if 
//  =  o{\/  jd  log (k(d  —  A:)))  then  the  minimax  risk,  for  any  allocation  strategy  tends  to  one. 
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The  biclusters  family  is  clearly  unitarily  invariant  with  respect  to  the  set  of  orthonormal  matrices 
that  permute  the  rows  and  columns  independently.  The  family  is  easiest  to  describe  as  acting 
on  the  matrices  l^lg  .  Let  Pi,Pr  be  any  two  d  x  d  permutation  matrices.  Then  the  matrix 
PilSl(lsrPr)T  is  clearly  another  hypothesis,  and  as  we  vary  I)  and  Pr  we  generate  all  of  the 
hypothesis.  Note  that  these  permutations  are  unitary  operators  on  the  matrix  space  Wlxd,  which 
allows  us  to  apply  Theorem  5.4. 

For  the  analysis  of  the  interactive  algorithm,  let  us  first  bound  the  probability  that  the  algorithm 
makes  a  mistake  on  any  single  coordinate.  Consider  sampling  a  coordinate  x  with  mean  //  and 
noise  variance  1/b.  A  Gaussian  tail  bound  reveals  that: 

F[\x-fi\  >e]<  2exp(— 2 be2). 

We  will  sample  no  more  than  d2  coordinates  and  we  will  sample  each  coordinate  with  the  same 
amount  of  energy  b.  So  by  the  union  bound,  the  probability  that  we  make  a  single  mistake  in 
classifying  a  coordinate  that  we  query  is  bounded  by  6/2  as  long  as: 

H>2e=  log(4d2/ 5). 

We  now  need  to  bound  b,  which  depends  on  the  total  number  of  coordinates  queried  by  the 
algorithm.  In  the  first  phase  of  the  algorithm,  we  sample  coordinates  uniformly  at  random  until 
we  hit  one  that  is  active.  Since  each  sample  hits  an  active  coordinate  with  probability  k2/d2: 

P[hit  active  coordinate  in  T  samples]  =  1  —  (1  —  k2 /d2)T  >  1  —  Tl2/d2 , 

or  if  T  =  p  log(2 /  5),  the  probability  that  we  hit  an  active  coordinate  in  T  samples  will  be  at  least 
1  —  5/2.  The  total  number  of  samples  we  use  then  can  be  upper  bounded  by  2d  +  p  log(2/<5), 
which  means  that  we  can  allocate  our  budget  r  evenly  over  these  coordinates.  Therefore  we  can 
set  b  =  r(2 d  +  p  log(2/5))-1,  and  plugging  into  the  condition  on  /i  above  proves  the  result. 

Calculation  for  Stars:  For  the  stars  problem,  define  Nb(j)  C  V  to  be  the  neighbors  of  the  vertex 
j  in  the  graph.  For  a  fixed  hypothesis  j,  we  have 

Wj(V,a)  =  J^exp  (~\\vk  -  vk\ \\/a) 

=  exp(-/u2(deg(fc) +deg(j)  -  2)/a)  +  exp(-/r2(deg(fc)  +  deg(j))/a) 

kENb(j)  k^Nb(j) 

<  exp  (-/r2deg  mJa  -  /r2deg  (j)/a)  (deg  (j)  exp(2/i2/a)  +  \V\-  deg  (j)) 

This  last  inequality  follows  by  replacing  every  deg (k)  with  degmin,  the  lower  bound  on  the  de¬ 
grees.  This  last  expression  is  maximized  with  deg(j)  =  degmin,  which  can  be  observed  by 
noticing  that  the  derivative  with  respect  to  deg(j)  is  negative.  This  gives  the  bound: 

W{V,a)  <  exp  (— 2/r2degmin/ a)  (degminexp(2 ^/a)  +  |C|  -  degmin) 
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One  can  lower  bound  W(V,a)  by  choosing  the  hypothesis  j  with  deg(j)  =  degmin  and  then 
replacing  all  other  degree  terms  with  degmax  in  the  above  calculations.  This  gives: 

W(V,  a)  >  exp  ^-^(degmin  +  degmax)^  (degmine2/i2/a  +  \V\  -  degmin) 

Calculation  for  Random  Codes:  In  the  proof  of  Theorem  5.1,  we  saw  that  for  a  hypothesis  j, 
we  can  bound  the  probability  of  error  by: 

Pj[error]  <  ^exp(— ||r^-  —  i’k\\l/8) 
kj=j 


This  means  that: 


EvE^Unif([M])Pj  [error]  <  Ev  sEE  exp(-lki  —  ^fclll/s) 

j=l  k^j 

d 

=  (M  —  l)E„i„/  exp(—  ||n  —  '^’/||  1/8)  =  (M  —  1)  J^EX^X2  exp(— Px/A) 

3= 1 

=  (M  —  1)(1  +  P/2)_d/2. 

Notice  that  the  only  inequality  in  this  sequence  is  the  first  one,  which  is  essentially  an  application 
of  Theorem  5.1.  The  last  equality  is  based  on  the  moment-generating  function  of  a  xl  random 
variable. 

To  achieve  the  bound  on  the  rate  of  the  code,  set  this  final  expression  to  be  at  most  some  f(d ) 
which  is  o(l).  Then  the  probability  of  error  is  at  most  f(d)  — >  0  and  the  rate  R  is: 

R  =  ^  =  l  logfl  +  Pi 2)  +  MM1  =  I  log(l  +  p/2)  -  U(l/d) 
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Chapter  6 


Conclusions 


In  this  thesis  we  studied  interactive  and  non-interactive  algorithms  for  several  unsupervised  learn¬ 
ing  problems  with  a  focus  on  understanding  the  advantages  in  statistics  and  computation  enabled 
by  the  interactive  paradigm.  Demonstration  of  the  statistical  advantage  of  interactive  learning  re¬ 
quires  both  new  algorithmic  ideas  and  new  technology  to  establish  fundamental  limits  on  learning 
paradigms.  In  this  thesis  we  made  progress  in  both  directions;  we  developed  several  new  interac¬ 
tive  learning  algorithms  and  also  showed  strong  limits  on  non-interactive  algorithms.  Combined 
these  sets  of  results  make  a  compelling  statistical  case  for  interactive  learning. 

In  the  examples  considered  here,  we  saw  how  uniformity  governed  the  level  of  statistical  im¬ 
provement  offered  by  interactive  learning.  In  problem  instances  with  high  degrees  of  non¬ 
uniformity,  which  was  measured  differently  in  each  problem,  we  saw  that  interactive  approaches 
are  significantly  stronger  than  non-interactive  ones.  While  there  is  at  present  no  unifying  the¬ 
ory  capturing  this  effect,  but  we  believe  that  the  examples  considered  here  lend  evidence  to  the 
importance  of  non- uniformity  for  interactive  learning. 

Regarding  computation,  many  of  the  interactive  algorithms  developed  here  are  faster  than  exist¬ 
ing  non-interactive  ones.  We  made  claims  about  the  computational  advantage  of  interactivity  in 
a  non-rigorous  way,  as  establishing  running-time  lower  bounds  is  often  quite  challenging.  Nev¬ 
ertheless,  we  find  this  claim  to  be  quite  surprising,  as  it  is  not,  at  present,  demonstrated  in  the 
literature  on  interactive  supervised  learning. 

While  there  is  still  much  to  be  explored  regarding  interactivity  in  machine  learning,  we  hope  the 
results  in  this  thesis  have  made  a  compelling  case  for  this  paradigm.  We  look  forward  to  future 
advances  in  this  direction  and  a  deeper  understanding  of  interactive  learning. 
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Appendix  A 


Concentration  Inequalities 


Here  we  collect  a  number  of  well-known  large  deviation  bounds  used  throughout  the  thesis. 
Proposition  A.l  (Scalar  Bernstein).  Let  X\, . . . ,  Xn  be  independent,  centered  scalar  random 
variables  with  a2  =  1  E[X?]  and  R  =  max,  A",  | .  Then: 

(A.l) 


Proposition  A.2  (Vector  Bernstein  [91  ]).  Let  Xi, . . . ,  Xn  be  independent  centered  random  vec¬ 
tors  with  E'LiEU^III  <  V.  Then  for  any  t  <  V(maxj  ||Xj||2)_1: 


P 


2 


>  W  +  t 


<  exp 


(A.2) 


Proposition  A.3  (Matrix  Bernstein  [165]).  Let  Xi, . . . ,  Xn  be  independent,  random,  self-adjoint 
matrices  with  dimension  d  satisfying: 


EX*.  =  0  and  1 1 1 1 2  <  R  almost  surely. 


Then,  for  all  t  >  0, 


P 


k=  1 


>  t  1  <  dexp 


-t2/ 2 
a2  +  Rt/3 


where  o2  = 


£e-v 


k= 1 


Proposition  A.4  (Rectangular  Matrix  Bernstein  [165]).  Let  Xi, . . . ,  Xn  be  independent  random 
matrices  with  dimension  d\  x  d2  satisfying: 

EA£  =  0  and  ||  AP||2  <  R  almost  surely. 
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